Alessandro Epasto
I am a senior staff research scientist at Google, New York working as Tech Lead of a privacy team in the Google Research Algorithms and Optimization group led by Vahab Mirrokni.
I received a Ph.D in computer science from Sapienza University of Rome, where I was advised by Professor Alessandro Panconesi and supported by the Google Europe Ph.D. Fellowship in Algorithms, 2011.
I was also a post-doc at the department of computer science of Brown University in Providence (RI), USA where I was advised by Professor Eli Upfal.
During my Ph.D. studies I was twice an intern at Google Mountain View (2012, 2014) and once at Google NYC (2013).
My research interests include algorithmic problems in machine learning and data mining, in particular in the areas of privacy, clustering, and large scale graphs analysis.
Authored Publications
Sort By
Differentially Private Synthetic Data Release for Topics API Outputs
Travis Dick
Josh Karlin
Adel Javanmard
Peilin Zhong
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2025)
Preview abstract
Recently, significant attention has been devoted to the design and analysis of Privacy-Preserving Ads APIs. Despite the interest of academics and regulators in understanding the privacy properties of such APIs, the empirical study of these methods is severely hindered by the lack of publicly-available data. This is because, a reliable empirical analysis of the privacy properties on an API requires access to a dataset consisting of realistic API outputs for a large collection of users. However, privacy reasons prevent the general release of such browsing data to the public.
In this work we address this problem by developing a novel methodology to construct synthetic API outputs that are simultaneously realistic enough to enable accurate study and provide strong privacy protections to the user.
We focus on one of the Privacy-Preserving Ads API: the Topics API; part of Google Chrome's Privacy Sandbox which enables interest-based advertising without relying third-party cookies. We develop a methodology to generate a Differentially Private dataset realistic enough to close match the re-identification risk properties of the real Topics API data. The use of differential privacy prevents the leak of private user information from this release.
Our methodology is based on first computing a large number of differentially-private statistics describing how output API traces evolve over time. Then, we design a parameterized distribution over sequences of API traces and optimize its parameters so that they closely matches the statistics obtained. Finally, we create the synthetic data by drawing from this distribution.
Our work is complemented with an open source release of the anonymized dataset obtained by this methodology. We hope this will enable external researchers to analyze the API in-depth and replicate prior and future work on a realistic large-scale dataset. We believe that this work will contribute to fostering transparency on the privacy properties of Privacy-Preserving Ads APIs.
View details
Scalable Private Partition Selection via Adaptive Weighting
Justin Y. Chen
Forty-second International Conference on Machine Learning (2025)
Preview abstract
In the differentially private partition selection problem (a.k.a. private set union, private key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy. Solutions to this problem are a core building block for many privacy-preserving ML applications including vocabulary extraction in a private corpus, computing statistics over categorical data and learning embeddings over user-provided items.
We propose an algorithm for this problem, MaxAdaptiveDegree(MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight, thereby increasing the probability that less frequent items are output. Our algorithm can be efficiently implemented in massively parallel computation systems allowing scalability to very large datasets. We prove that our algorithm stochastically dominates the standard parallel algorithm for this problem. We also develop a two-round version of our algorithm, MAD2R, where results of the computation in the first round are used to bias the weighting in the second round to maximize the number of items output. In experiments, our algorithms provide the best results across the board among parallel algorithms and scale to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed by prior sequential algorithms.
View details
Preview abstract
In the differentially private partition selection problem (a.k.a. private set union, private key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy. Solutions to this problem are a core building block for many privacy-preserving ML applications including vocabulary extraction in a private corpus, computing statistics over categorical data and learning embeddings over user-provided items.
We propose an algorithm for this problem, MaxAdaptiveDegree(MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight, thereby increasing the probability that less frequent items are output. Our algorithm can be efficiently implemented in massively parallel computation systems allowing scalability to very large datasets. We prove that our algorithm stochastically dominates the standard parallel algorithm for this problem. We also develop a two-round version of our algorithm, MAD2R, where results of the computation in the first round are used to bias the weighting in the second round to maximize the number of items output. In experiments, our algorithms provide the best results across the board among parallel algorithms and scale to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed by prior sequential algorithms.
View details
Preview abstract
In modern datasets, where single records can have multiple owners, enforcing user-level differential privacy requires capping each user's total contribution. This "contribution bounding" becomes a significant combinatorial challenge. Existing sequential algorithms for this task are computationally intensive and do not scale to the massive datasets prevalent today. To address this scalability bottleneck, we propose a novel and efficient distributed algorithm. Our approach models the complex ownership structure as a hypergraph, where users are vertices and records are hyperedges. The algorithm proceeds in rounds, allowing users to propose records in parallel. A record is added to the final dataset only if all its owners unanimously agree, thereby ensuring that no user's predefined contribution limit is violated. This method aims to maximize the size of the resulting dataset for high utility while providing a practical, scalable solution for implementing user-level privacy in large, real-world systems.
View details
Scalable Private Partition Selection via Adaptive Weighting
Justin Y. Chen
Forty-second International Conference on Machine Learning (2025)
Preview abstract
In the differentially private partition selection problem (a.k.a. private set union, private key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy. Solutions to this problem are a core building block for many privacy-preserving ML applications including vocabulary extraction in a private corpus, computing statistics over categorical data and learning embeddings over user-provided items.
We propose an algorithm for this problem, MaxAdaptiveDegree(MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight, thereby increasing the probability that less frequent items are output. Our algorithm can be efficiently implemented in massively parallel computation systems allowing scalability to very large datasets. We prove that our algorithm stochastically dominates the standard parallel algorithm for this problem. We also develop a two-round version of our algorithm, MAD2R, where results of the computation in the first round are used to bias the weighting in the second round to maximize the number of items output. In experiments, our algorithms provide the best results across the board among parallel algorithms and scale to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed by prior sequential algorithms.
View details
Preview abstract
In the differentially private set union problem, users contribute sets of items as input, and the output is a subset of the union of all items. Algorithms for this problem seek to output as many items as possible while maintaining differential privacy with respect to the addition or removal of an individual user.
The basic solution to this problem maintains a weight over each item. Each user contributes uniformly to the items in their set, random noise is added to the weights, and items with noisy weight above a certain threshold are output. The only scalable (i.e., distributed) algorithms for this problem from prior work are this basic algorithm and an iterative method which repeatedly calls the basic algorithm, ignoring items found in prior invocations.
In this work, we give an improved weighting algorithm over basic uniform weighting. Our algorithm reroutes weight from items with weight far above the threshold to items with smaller weight, thereby increasing the probability that less frequent items are output. The algorithm is scalable and does not suffer any privacy loss when compared to the basic algorithm. We prove that our algorithm will never underperform the basic algorithm and show experimentally that replacing the basic algorithm with ours yields the best results among scalable algorithms for the private set union problem.
View details
Differentially Private Space-Efficient Algorithms for Counting Distinct Elements in the Turnstile Model
Rachel Cummings
Tamalika Mukherjee
Tingting Ou
Peilin Zhong
ICML 2025 (2025)
Preview abstract
The {continual observation} streaming setting captures data analysis scenarios where both data collection is ongoing, and real-time analysis of the collected data is needed at all time points. The \emph{turnstile} streaming model is a subclass of the continual observation model and allows for both the insertion and deletion of data items over time. Typically, the length of the stream $T$ and the size of the universe $\cU$ from which data items come from are considered to be extremely large, thus practical solutions should use space at most sublinear in $T$ and $\mathcal{U}$. Because many of the relevant applications for data analysis in the continual observation setting involve the analysis of sensitive user data, this motivates the need to develop space-efficient algorithms that achieve formal privacy protections such as differential privacy.
We give the first sublinear space differentially private algorithms for the fundamental problem of counting distinct elements in the turnstile streaming model that achieves $\Tilde{O}(T^{1/3})$ space and additive error and a $(1+\eta)$-relative approximation for all $\eta \in (0,1)$. Our algorithm significantly improves upon the space requirements of the state of the art for this problem in the turnstile model~\cite{JainKRSS23} which has a linear dependency in both $T$ and $\vert \cU \vert$, while still achieving an additive error that is close to the known $T^{1/4}$ lower bound.
View details
Differentially Private Streaming Continual Releases of Frequency Moment Estimation
Peilin Zhong
ITCS (2023)
Preview abstract
The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible.
Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental $\ell_p$ $(p\in [0,+\infty))$ frequency moment estimation problem under this setting, and give an $\varepsilon$-DP algorithm that achieves $(1+\eta)$-relative approximation $(\forall \eta\in(0,1))$ with $\mathrm{poly}\log(Tn)$ additive error and uses $\mathrm{poly}\log(Tn)\cdot \max(1, n^{1-2/p})$ space, where $T$ is the length of the stream and $n$ is the size of the universe of elements.
Our space is near optimal up to poly-logarithmic factors even in the non-private setting.
To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting.
Based on these primitives, we develop a differentially private continual release level set estimation approach to address the $\ell_p$ frequency moment estimation problem.
We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past $W$ data items.
View details
Private estimation algorithms for stochastic block models and mixture models
Hongjie Chen
Tommaso D'Orsi
Jacob Imola
David Steurer
Stefan Tiegel
54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
Preview abstract
We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient (ϵ,δ)-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an (ϵ,δ)-differentially private algorithm that recovers the centers of the k-mixture when the minimum separation is at least O(k^{1/t}/√t). For all choices of t, this algorithm requires sample complexity n≥k^O(1)d^O(t) and time complexity (nd)^O(t). Prior work required minimum separation at least O(√k) as well as an explicit upper bound on the Euclidean norm of the centers.
View details
Measuring Re-identification Risk
Travis Dick
Adel Javanmard
Josh Karlin
Gabriel Henrique Nunes
Peilin Zhong
SIGMOD (2023)
Preview abstract
Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications.
View details