Nicolas Papernot
Research Areas
Authored Publications
Sort By
Preview abstract
We study the problem of model extraction in natural language processing, where an adversary with query access to a victim model attempts to reconstruct a local copy of the model. We show that when both the adversary and victim model fine-tune existing pretrained models such as BERT, the adversary does not need to have access to any training data to mount the attack. Indeed, we show that randomly sampled sequences of words, which do not satisfy grammar structures, make effective queries to extract textual models. This is true even for complex tasks such as natural language inference or question answering.
Our attacks can be mounted with a modest query budget of less than $400.The extraction's accuracy can be further improved using a large textual corpus like Wikipedia, or with intuitive heuristics we introduce. Finally, we measure the effectiveness of two potential defense strategies---membership classification and API watermarking. While these defenses mitigate certain adversaries and come at a low overhead because they do not require re-training of the victim model, fully coping with model extraction remains an open problem.
View details
Improving Differentially Private Models with Active Learning
Zhengli Zhao
Sameer Singh
Neoklis Polyzotis
Augustus Odena
arXiv preprint arXiv:1910.01177 (2019)
Preview abstract
Broad adoption of machine learning techniques has increased privacy concerns for models trained on sensitive data such as medical records. Existing techniques for training differentially private (DP) models give rigorous privacy guarantees, but applying these techniques to neural networks can severely degrade model performance. This performance reduction is an obstacle to deploying private models in the real world. In this work, we improve the performance of DP models by fine-tuning them through active learning on public data. We introduce two new techniques - DIVERSEPUBLIC and NEARPRIVATE - for doing this fine-tuning in a privacy-aware way. For the MNIST and SVHN datasets, these techniques improve state-of-the-art accuracy for DP models while retaining privacy guarantees.
View details
MixMatch: A Holistic Approach to Semi-Supervised Learning
David Berthelot
Ian Goodfellow
Avital Oliver
Colin Raffel
NeurIPS (2019) (to appear)
Preview abstract
Semi-supervised learning has proven to be a powerful paradigm for leveraging
unlabeled data to mitigate the reliance on large labeled datasets. In this work, we
unify the current dominant approaches for semi-supervised learning to produce a
new algorithm called MixMatch. MixMatch works by guessing low-entropy la-
bels for data-augmented unlabeled examples, and then mixes labeled and unlabeled
data using MixUp. We show that MixMatch obtains state-of-the-art results by a
large margin across many datasets and labeled data amounts. We also demonstrate
how MixMatch can help achieve a dramatically better accuracy-privacy trade-off
for differential privacy. Finally, we perform an ablation study to tease apart which
components of MixMatch are most important for its success.
View details
Preview abstract
We introduce the Soft Nearest Neighbor Loss that measures the entanglement of class manifolds in representation space: i.e., how close pairs of points from the same class are relatively to pairs of points from different classes. We demonstrate several use cases of the loss. As an analytical tool, it provides insights into the evolution of similarity structures during learning. Surprisingly, we find that maximizing the entanglement of representations of different classes in the hidden layers is beneficial for discriminating, possibly because it encourages representations to identify class-independent similarity structures. Maximizing the soft nearest neighbor loss in the hidden layers leads not only to improved generalization but also to better-calibrated estimates of uncertainty on outlier data. Data that is not from the training distribution can be recognized by observing that in the hidden layers, it has fewer than the normal number of neighbors from the predicted class.
View details
Scalable Private Learning with PATE
Ilya Mironov
Ananth Raghunathan
Kunal Talwar
International Conference on Learning Representations (ICLR) (2018)
Preview abstract
The rapid adoption of machine learning has increased concerns about the privacy implications of machine learning models trained on sensitive data, such as medical records or other personal information. To address those concerns, one promising approach is Private Aggregation of Teacher Ensembles, or PATE, which transfers to a "student" model the knowledge of an ensemble of "teacher" models, with intuitive privacy provided by training teachers on disjoint data and strong privacy guaranteed by noisy aggregation of teachers’ answers. However, PATE has so far been evaluated only on simple classification tasks like MNIST, leaving unclear its utility when applied to larger-scale learning tasks and real-world datasets.
In this work, we show how PATE can scale to learning tasks with large numbers of output classes and uncurated, imbalanced training data with errors. For this, we introduce new noisy aggregation mechanisms for teacher ensembles that are more selective and add less noise, and prove their tighter differential-privacy guarantees. Our new mechanisms build on two insights: the chance of teacher consensus is increased by using more concentrated noise and, lacking consensus, no answer need be given to a student. The consensus answers used are more likely to be correct, offer better intuitive privacy, and incur lower-differential privacy cost. Our evaluation shows our mechanisms improve on the original PATE on all measures, and scale to larger tasks with both high utility and very strong privacy (ε < 1.0).
View details
Adversarial Examples that Fool both Computer Vision and Time-Limited Humans
Gamaleldin Fathy Elsayed
Shreya Shankar
Brian Cheung
Ian Goodfellow
Jascha Sohl-dickstein
NeurIPS (2018)
Preview abstract
Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.
View details
Preview abstract
Adversarial examples are perturbed inputs designed to fool machine learning models.
Adversarial training injects such examples into training data to increase robustness.
To scale this technique to large datasets, perturbations are crafted using
fast single-step methods that maximize a linear approximation of the model’s loss.
We show that this form of adversarial training converges to a degenerate global
minimum, wherein small curvature artifacts near the data points obfuscate a linear
approximation of the loss. The model thus learns to generate weak perturbations,
rather than defend against strong ones. As a result, we find that adversarial
training remains vulnerable to black-box attacks, where we transfer perturbations
computed on undefended models, as well as to a powerful novel single-step attack
that escapes the non-smooth vicinity of the input data via a small random step.
We further introduce Ensemble Adversarial Training, a technique that augments
training data with perturbations transferred from other models. We use ensemble
adversarial training to train ImageNet models with strong robustness to black-box
attacks. In particular, our most robust model won the first round of the NIPS 2017
competition on Defenses against Adversarial Attacks
View details
A General Approach to Adding Differential Privacy to Iterative Training Procedures
Galen Andrew
Ilya Mironov
Steve Chien
NIPS (2018)
Preview abstract
In this work we address the practical challenges of training machine learning models on privacy-sensitive datasets by introducing a modular approach that minimizes changes to training algorithms, provides a variety of configuration strategies for the privacy mechanism, and then isolates and simplifies the critical logic that computes the final privacy guarantees. A key challenge is that training algorithms often require estimating many different quantities (vectors) from the same set of examples --- for example, gradients of different layers in a deep learning architecture, as well as metrics and batch normalization parameters. Each of these may have different properties like dimensionality, magnitude, and tolerance to noise. By extending previous work on the Moments Accountant for the subsampled Gaussian mechanism, we can provide privacy for such heterogeneous sets of vectors, while also structuring the approach to minimize software engineering challenges.
View details
Preview abstract
Adversarial examples are maliciously perturbed inputs designed to mislead machine learning (ML) models at test-time. They often transfer: the same adversarial example fools more than one model.
In this work, we propose novel methods for estimating the previously unknown dimensionality of the space of adversarial inputs. We find that adversarial examples span a contiguous subspace of large (~25) dimensionality. Adversarial subspaces with higher dimensionality are more likely to intersect. We find that for two different models, a significant fraction of their subspaces is shared, thus enabling transferability.
In the first quantitative analysis of the similarity of different models' decision boundaries, we show that these boundaries are actually close in arbitrary directions, whether adversarial or benign. We conclude by formally studying the limits of transferability. We derive (1) sufficient conditions on the data distribution that imply transferability for simple model classes and (2) examples of scenarios in which transfer does not occur. These findings indicate that it may be possible to design defenses against transfer-based attacks, even for models that are vulnerable to direct attacks.
View details
Preview abstract
Machine learning classifiers are known to be vulnerable to inputs maliciously constructed by adversaries to force misclassification. Such adversarial examples have been extensively studied in the context of computer vision applications. In this work, we show adversarial attacks are also effective when targeting neural network policies in reinforcement learning. Specifically, we show existing adversarial example crafting techniques can be used to significantly degrade test-time performance of trained policies. Our threat model considers adversaries capable of introducing small perturbations to the raw input of the policy. We characterize the degree of vulnerability across tasks and training algorithms, for a subclass of adversarial-example attacks in white-box and black-box settings. Regardless of the learned task or training algorithm, we observe a significant drop in performance, even with small adversarial perturbations that do not interfere with human perception. Videos are available at this http URL
View details