ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Yuran Ding
Xinwei Chen
Zongwei Zhou
2025

Abstract

Optimizing large language model (LLM) training on dritibuted domain-specific accelerator systems presents significant challenges due to its complex optimization space and reliance on manual processes, resulting in slow development and underutilize resources. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain. To address this, we introduce the ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, analytical roofline models, and a knowledge base of best practices and successful past optimizations. Our proposed design can automate the diagnosis of performance bottlenecks and intelligently generates optimized sharding configurations with reasoning, and effectively improve the distributed LLM training efficiency. This approach promises to significantly reduce manual effort, shorten iteration cycles, and enhance accelerator utilization, offering a scalable and explanable methodology for AI-assisted performance engineering in large-scale machine learning.
×