Spider2.0-GUI: Can Multimodal Agents Achieve Expert Proficiency in Data Science and Engineering?
Abstract
The field of data science and engineering is crucial for harnessing large-scale data to assist both individuals and enterprises in analytical processing and automated orchestration. Despite the significance, large language model~(LLM)-based data agents remain underexplored, particularly concerning professional data engineering tools such as {\tt dbt}, {\tt Airflow}, and {\tt Airbyte}, which are complex to use and include intensive GUI operations. To bridge this gap, we introduce Spider2.0-GUI, the first benchmark focusing on enterprise data engineering softwares across a full data pipeline. It encapsulates $486$ tasks involving $20$ professional softwares, guiding through tasks such as data warehousing, ingestion, transformation, analysis, visualization, and orchestration. Each task is paired with both abstract and verbose instructions, considering different levels of user expertise. We also build a comprehensive document warehouse with $11,231$ documents for Spider2.0-GUI to support retrieval-augmented agent frameworks. The benchmark is further enhanced with a real-time, executable Ubuntu desktop environment that interacts with real-world internet, providing a realistic and dynamic testing ground. Preliminary results with state-of-the-art vision language models~(VLMs) indicate that even the most advanced model only achieves $11\%$ success rate~(SR) with abstract instructions, and $21\%$ SR with verbose instructions~(a.k.a., step-by-step tutorials). This benchmark not only investigates the competencies of data agents, but also paves the way for future advancements in real-world automated data science and engineering tasks.