
Badih Ghazi
I am a Research Scientist in the Algorithms & Optimization Team at Google. Here's a link to my personal webpage
Authored Publications
Sort By
How Unique is Whose Web Browser? The role of demographics in browser fingerprinting
Pritish Kamath
Robin Lassonde
2025
Preview abstract
Web browser fingerprinting can be used to identify and track users across the Web, even without cookies, by collecting attributes from users' devices to create unique "fingerprints". This technique and resulting privacy risks have been studied for over a decade. Yet further research is limited because prior studies did not openly publish their data. Additionally, data in prior studies had biases and lacked user demographics.
Here we publish a first-of-its-kind open dataset that includes browser attributes with users' demographics, collected from 8,400 US study participants, with their informed consent. Our data collection process also conducted an experiment to study what impacts users' likelihood to share browser data for open research, in order to inform future data collection efforts, with survey responses from a total of 12,461 participants. Female participants were significantly less likely to share their browser data, as were participants who were shown the browser data we asked to collect.
In addition we demonstrate how fingerprinting risks differ across demographic groups. For example, we find lower income users are more at risk, and find that as users' age increases, they are both more likely to be concerned about fingerprinting and at real risk of fingerprinting. Furthermore, we demonstrate an overlooked risk: user demographics, such as gender, age, income level, ethnicity and race, can be inferred from browser attributes commonly used for fingerprinting, and we identify which browser attributes most contribute to this risk.
Overall, we show the important role of user demographics in the ongoing work that intends to assess fingerprinting risks and improve user privacy, with findings to inform future privacy enhancing browser developments. The dataset and data collection tool we openly publish can be used to further study research questions not addressed in this work.
View details
Leveraging Bias-Variance Trade-offs for Regression with Label Differential Privacy
Ashwinkumar Badanidiyuru Varadaraja
Avinash Varadarajan
Chiyuan Zhang
Ethan Leeman
Pritish Kamath
NeurIPS 2023 (2023)
Preview abstract
We propose a new family of label randomization mechanisms for the task of training regression models under the constraint of label differential privacy (DP). In particular, we leverage the trade-offs between bias and variance to construct better noising mechanisms depending on a privately estimated prior distribution over the labels. We demonstrate that these mechanisms achieve state-of-the-art privacy-accuracy trade-offs on several datasets, highlighting the importance of bias-reducing constraints when training neural networks with label DP. We also provide theoretical results shedding light on the structural properties of the optimal bias-reduced mechanisms.
View details
Differentially Private Heatmaps
Kai Kohlhoff
2023
Preview abstract
We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private algorithm for this task and demonstrate its advantages over previous algorithms on several real-world datasets.
Our core algorithmic primitive is a differentially private procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance (EMD) to the average of the inputs. We prove theoretical bounds on the error of our algorithm under certain sparsity assumption and that these are essentially optimal.
View details
Preview abstract
Differential privacy is often applied with a privacy parameter that is larger than the theory suggests is ideal; various informal justifications for tolerating large privacy parameters have been proposed.
In this work, we consider partial differential privacy (DP), which allows quantifying the privacy guarantee on a per-attribute basis.
In this framework, we study several basic data analysis and learning tasks, and design algorithms whose per-attribute privacy parameter is smaller that the best possible privacy parameter for the entire record of a person (i.e., all the attributes).
View details
Differentially Private All-Pairs Shortest Path Distances: Improved Algorithms and Lower Bounds
Jelani Osei Nelson
Justin Y. Chen
Shyam Narayanan
Yinzhan Xu
SODA 2023 (to appear)
Preview abstract
We study the problem of releasing the weights of all-pairs shortest paths in a weighted undirected graph with differential privacy (DP). In this setting, the underlying graph is fixed and two graphs are neighbors if their edge weights differ by at most 1 in the ℓ1-distance. We give an algorithm with additive error ̃O(n^2/3/ε) in the ε-DP case and an algorithm with additive error ̃O(√n/ε) in the (ε, δ)-DP case, where n denotes the number of vertices. This positively answers a question of Sealfon [Sea16, Sea20], who asked whether a o(n) error algorithm exists. We also show that an additive error of Ω(n1/6) is necessary for any sufficiently small ε, δ > 0.
Furthermore, we show that if the graph is promised to have reasonably bounded weights, one can improve the error further to roughly n^{(√17−3)/2+o(1)}/ε in the ε-DP case and roughly n^{√2−1+o(1)}/ε in the (ε, δ)-DP case. Previously, it was only known how to obtain ̃O(n2/3/ε1/3) additive error in the ε-DP case and ̃O(√n/ε) additive error in the (ε, δ)-DP case for bounded-weight graphs [Sea16].
Finally, we consider a relaxation where a multiplicative approximation is allowed. We show that, with a multiplicative approximation factor k, the additive error can be reduced to ̃O(n^{1/2+O(1/k)}/ε) in the ε-DP case and ̃O(n^{1/3+O(1/k)}/ε) in the (ε, δ)-DP case.
View details
Preview abstract
In this work, we study the task of estimating the numbers of distinct and k-occurring items in a time window under the constraint of differential privacy (DP). We consider several variants depending on whether the queries are on general time windows (between times t1 and t2), or are restricted to being cumulative (between times 1 and t2), and depending on whether the DP neighboring relation is event-level or the more stringent item-level. We obtain nearly tight upper and lower bounds on the errors of DP algorithms for these problems. En route, we obtain an event-level DP algorithm for estimating, at each time step, the number of distinct items seen over the last W updates with error polylogarithmic in W; this answers an open question of Bolot et al. (ICDT 2013).
View details
Preview abstract
In this work, we study the large-scale pretraining of BERT-Large~\citep{devlin2018bert} with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance the training efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of \citet{subramani20}, who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX \cite{jax2018github, frostig2018compiling} primitives in conjunction with the XLA compiler \cite{xladocs}. Our implementation achieves a masked language model accuracy of 60.5\% at a batch size of 2M, for $\eps = 5$, which is a reasonable privacy setting. To put this number in perspective, non-private BERT models achieve an accuracy of $\sim$70\%.
View details
Preview abstract
The privacy loss distribution (PLD) provides a tight characterization of the privacy loss of a mechanism in the context of differential privacy (DP). Recent work has shown that PLD-based accounting allows for tighter (ε,δ)-DP guarantees for many popular mechanisms compared to other known methods. A key question in PLD-based accounting is how to approximate any (potentially continuous) PLD with a PLD over any specified discrete support.
We present a novel approach to this problem. Our approach supports both pessimistic estimation, which overestimates the hockey-stick divergence (i.e., δ) for any value of ε, and optimistic estimation, which underestimates the hockey-stick divergence. Moreover, we show that our pessimistic estimate is the best possible among all pessimistic estimates. Experimental evaluation shows that our approach can work with much larger discretization intervals while keeping a similar error bound compared to previous approaches and yet give a better approximation than existing methods.
View details
Preview abstract
We study the problem of privately computing the \emph{anonymized histogram} (aka \emph{unattributed histogram}), which is defined as the histogram without item labels. Previous works have provided algorithms with $\ell_1$ and $\ell_2$ errors of $O_\eps(\sqrt{n})$ in the central model of differential privacy (DP).
In this work, we provide an algorithm with a nearly matching error guarantee of $\tilde{O}_\eps(\sqrt{n})$ in the shuffle and pan private DP models. Our algorithm is very simple: it just post-processes the discrete Laplace-noised histogram! Using this algorithm as a subroutine, we show applications in estimating several symmetric properties of distributions such as the entropy and support size.
View details