ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Yuran Ding

Xinwei Chen

Xiaofan Zhang

Zongwei Zhou

2025

Download Google Scholar

Abstract

Optimizing large language model (LLM) training on dritibuted domain-specific accelerator systems presents significant challenges due to its complex optimization space and reliance on manual processes, resulting in slow development and underutilize resources. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain. To address this, we introduce the ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, analytical roofline models, and a knowledge base of best practices and successful past optimizations. Our proposed design can automate the diagnosis of performance bottlenecks and intelligently generates optimized sharding configurations with reasoning, and effectively improve the distributed LLM training efficiency. This approach promises to significantly reduce manual effort, shorten iteration cycles, and enhance accelerator utilization, offering a scalable and explanable methodology for AI-assisted performance engineering in large-scale machine learning.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Abstract

Research Areas

Meet the teams driving innovation

Google Ai

Google Cloud

Google DeepMind

Google Labs