Ananth Balashankar

Ananth Balashankar

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Safety classifiers are used in critical settings such as toxicity detection. However, they are brittle and their failure cases are unknown. Traditional adversarial data generation methods are rigid and often result in similar types of attacks. Enumerating the attack types and collecting corresponding examples of each attack type is an expensive and infeasible solution. In order to discover new types of attacks, we need to automated methods for discovery of adversarial types. Current methods of attack generation rely on simple perturbations, which is unlikely to generate naturally occurring data, or language models, which are unlikely to generate data about unknown dimensions. To address this goal of discovering new types of attacks for safety classifiers, we introduce a discover-adapt framework that leverages large language models (LLMs) to iteratively identify different subtypes of toxicity (discover) and transform seed text to suit that subtype (adapt). Using adversarial success and dimensional diversity as evaluation metrics, we demonstrate that our method results in more desired data than existing approaches when generating identity attacks, insults and sexually explicit content. View details
    Preview abstract Developing robust NLP models that perform well on many, even small, slices of data is a significant but important challenge, with implications from fairness to general reliability. To this end, recent research has explored how models rely on spurious correlations, and how counterfactual data augmentation (CDA) can mitigate such issues. In this paper we study how and why modeling counterfactuals over multiple attributes can go significantly further in improving model performance. We propose RDI, a context-aware methodology which takes into account the impact of secondary attributes on the model’s predictions and increases sensitivity for secondary attributes over reweighted counterfactually augmented data. By implementing RDI in the context of toxicity detection, we find that accounting for secondary attributes can significantly improve robustness, with improvements in sliced accuracy on the original dataset up to 7% compared to existing robustness methods. We also demonstrate that RDI generalizes to the coreference resolution task and provide guidelines to extend this to other tasks. View details
    ×