Adversarial Attacks in Multimodal Systems: A Practitioner's Survey

Shashank Kapoor
Ankit Shetgaonkar
Aman Raj
2025

Abstract

Multimodal models represent a significant advancement in Artificial Intelligence. A single model is trained to understand unstructured modalities: text, image, video, and audio. Open-source variants of multimodal models have made these breakthroughs further accessible. ML practitioners adopt, finetune, and deploy open-source models in real-world applications. However, considering the vast landscape of adversarial attacks across these modalities, these models also inherit vulnerabilities of all the modalities, and eventually, the adversarial threat amplifies. While broad research is available on possible attacks within or across these modalities, a practitioner-focused view of outlining attack types remains absent in the multimodal world. This paper addresses the gap by surveying adversarial attacks targeting all four modalities: text, image, video, and audio. This survey provides a view of the adversarial attack landscape and presents how multimodal adversarial threats have evolved. To the best of our knowledge, this survey is the first comprehensive summarization of the threat landscape in the multimodal world.