Xinliang David Li
David is a Principal Engineer at Google. He manages the Compiler Optimization Team and leads the effort to generate the fastest possible code for Google's data center softwares. His research interests include highly scalable cross module optimizations, scalable profile guided optimizations (instrumentation, PMU sample, trace-based), memory hierarchy optimizations (data and instructions), micro-architecture optimizations, post link optimizations, and ML based optimization frameworks.
Authored Publications
Sort By
PreFix: Optimizing the Performance of Heap-Intensive Applications
Chaitanya Mamatha Ananda
Rajiv Gupta
Han Shen
CGO 2025: International Symposium on Code Generation and Optimization, Las Vegas, NV, USA (to appear)
Preview abstract
Analyses of heap-intensive applications show that a small fraction of heap objects account for the majority of heap accesses and data cache misses. Prior works like HDS and HALO have shown that allocating hot objects in separate memory regions can improve spatial locality leading to better application performance. However, these techniques are constrained in two primary ways, limiting their gains. First, these techniques have Imperfect Separation, polluting the hot memory region with several cold objects. Second, reordering of objects across allocations is not possible as the original object allocation order is preserved. This paper presents a novel technique that achieves near perfect separation of hot objects via a new context mechanism that efficiently identifies hot objects with high precision. This technique, named PreFix, is based upon Preallocating memory for a Fixed small number of hot objects. The program, guided by profiles, is instrumented to compute context information derived from
dynamic object identifiers, that precisely identifies hot object allocations that are then placed at predetermined locations in the preallocated memory. The preallocated memory region for hot objects provides the flexibility to reorder objects across allocations and allows colocation of objects that are part of a hot data stream (HDS), improving spatial locality. The runtime overhead of identifying hot objects is not significant as this optimization is only focused on a small number of static hot allocation sites and dynamic hot objects. While there is an increase in the program’s memory foot-print, it is manageable and can be controlled by limiting the size of the preallocated memory. In addition, PreFix incorporates an object recycling optimization that reuses the same preallocated space to store different objects whose lifetimes are not expected to overlap. Our experiments with 13 heap-intensive applications yields reductions in execution times ranging from 2.77% to 74%. On average PreFix reduces execution time by 21.7% compared to 7.3% by HDS and 14% by HALO. This is due to PreFix’s precision in hot object identification, hot object colocation, and low runtime overhead.
View details
Preview abstract
https://www.overleaf.com/project/65ba7d45dae2bce751dba252
Hashing is a fundamental operation in various computer sci-
ence applications. Despite the prevalence of specific key
formats like social security numbers, MAC addresses, plate
numbers, and URLs, hashing libraries typically treat them as
general byte sequences. This paper introduces a technique
for synthesizing specialized hash functions tailored to par-
ticular byte formats. The proposed code generation method
leverages three prevalent patterns: (i) fixed-length keys, (ii)
keys with common subsequences, and (iii) keys ranging on
predetermined sequences of bytes. The code generation pro-
cess involves two algorithms: one identifies relevant regular
expressions within key examples, and the other generates
specialized hash functions based on these expressions. This
approach, straightforward to implement, showcases improve-
ments over highly optimized hash function implementations.
Comparative analysis demonstrates that our synthetic func-
tions outperform counterparts in the C++ Standard Template
Library and the Google Abseil Library, achieving speedups
ranging from 2% to 11%, depending on the key format.
View details
Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications
Han Shen
Rahman Lavaee
ACM, pp. 617-631
Preview abstract
While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks.
View details
Preview abstract
Estimating the probability with which a conditional branch instruction is taken is an important analysis that enables many optimizations in modern compilers. When using Profile Guided Optimizations (PGO), compilers are able to make a good estimation of the branch probabilities. In the absence of profile information, compilers resort to using heuristics for this purpose. In this work, we propose learning branch probabilities from a large corpus of data obtained from datacenter workloads.
Using metrics including Root Mean Squared Error, Mean Absolute Error and cross-entropy, we show that the machine learning model improves branch probability estimation by 18-50% in comparison to compiler heuristics. This translates to performance improvement of up to 8.1% on 24 out of a suite of 40 benchmarks with a 1% geomean improvement on the suite. This also results in greater than 1.2% performance improvement in an important search application.
View details
automemcpy: A framework for automatic generation of fundamental memory operations
Sam Xi
Proceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management (ISMM '21), June 22, 2021, Virtual, Canada (to appear)
Preview abstract
Memory manipulation primitives (memcpy, memset, memcmp) are used by virtually every application, from high performance computing to user interfaces. They often consume a significant portion of CPU cycles. Because they are so ubiquitous and critical, they are provided by language runtimes and in particular by libc, the C standard library. These implementations are heavily optimized, typically written in hand-tuned assembly for each target architecture.
In this article, we propose a principled alternative to hand-tuning these functions: (1) we profile the calls to these functions in their production environment and use this data to drive the important high-level algorithmic decisions, (2) we use a high-level language for the implementation, delegate the job of tuning the generated code to the compiler, and (3) we use constraint programming and automatic benchmarks to select the optimal high-level structure of the functions.
We compile our memfunctions implementations using the same compiler toolchain that we use for application code, which allows leveraging the compiler further by allowing whole-program optimization. We have evaluated our approach by applying it to the fleet of one of the largest computing enterprises in the world. This work increased the performance of the fleet by 1%.
View details
Preview abstract
Leveraging machine-learning (ML) techniques for compiler optimizations has been widely studied and explored in academia. However, the adoption of ML in general-purpose, industry strength compilers has yet to happen. We propose MLGO, a framework for integrating ML techniques systematically in an industrial compiler -- LLVM. As a case study, we present the details and results of replacing the heuristics-based inlining-for-size optimization in LLVM with machine learned models. To the best of our knowledge, this work is the first full integration of ML in a complex compiler pass in a real-world setting. It is available in the main LLVM repository. We use two different ML algorithms: Policy Gradient and Evolution Strategies, to train the inlining-for-size model, and achieve up to 7\% size reduction, when compared to state of the art LLVM -Oz. The same model, trained on one corpus, generalizes well to a diversity of real-world targets, as well as to the same set of targets after months of active development. This property of the trained models is beneficial to deploy ML techniques in real-world settings.
View details
Preview abstract
Cross-Module Optimization (CMO) is an effective means for improving runtime performance, by extending the scope of optimizations across source module boundaries. Two CMO approaches are Link-Time Optimization (LTO) and Lightweight Inter-Procedural Optimization (LIPO). However, each of these solutions has limitations that prevent it from being enabled by default. ThinLTO is a new approach that attempts to address these limitations, with a goal of being enabled more broadly. ThinLTO aims to be as scalable as a regular non-LTO build, enabling CMO on large applications and machines without large memory configurations, while also integrating well with distributed and incremental build systems. This is achieved through fast purely summary-based Whole-Program Analysis (WPA), the only serial step, without reading or writing the program's Intermediate Representation (IR). Instead, CMO is applied during fully parallel optimization backends. This paper describes the motivation behind ThinLTO, its overall design, and current implementation in LLVM. Results from SPEC cpu2006 benchmarks and several large real-world applications illustrate that ThinLTO can scale as well as a non-LTO build while enabling most of the CMO performed with a full LTO build.
View details
AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications
Dehao Chen
CGO 2016 Proceedings of the 2016 International Symposium on Code Generation and Optimization, ACM, New York, NY, USA, pp. 12-23
Preview abstract
AutoFDO is a system to simplify real-world deployment of feedback-directed optimization (FDO). The system works by sampling hardware performance monitors on production machines and using those profiles to guide optimization. Profile data is stale by design, and we have implemented compiler features to deliver stable speedup across releases. The resulting performance is a geomean of 10.5% improvement on our benchmarks. AutoFDO achieves 85% of the gains of traditional FDO, despite imprecision due to sampling and information lost in the compilation pipeline. The system is deployed to hundreds of binaries at Google, and it is extremely easy to enable; users need only to add some flags to their release build. To date, AutoFDO has increased the number of FDO users at Google by 8X and has doubled the number of cycles spent in FDO-optimized binaries. Over half of CPU cycles used are now spent in some flavor of FDO-optimized binaries.
View details
Automated locality optimization based on the reuse distance of string operations
Preview
Silvius Rus
Raksit Ashok
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, IEEE Computer Society, Washington, DC, USA (2011), pp. 181-190
Lightweight Feedback-Directed Cross-Module Optimization
Raksit Ashok
Proceedings of International Symposium on Code Generation and Optimization (CGO), IEEE (2010)
Preview abstract
Cross-module inter-procedural compiler optimization (IPO) and Feedback-Directed Optimization (FDO) are two important compiler techniques delivering solid performance gains. The combination of IPO and FDO delivers peak performance, but also multiplies both
techniques' usability problems. In this paper, we present LIPO, a novel static IPO framework, which integrates IPO and FDO. Compared to existing approaches, LIPO no
longer requires writing of the compiler's intermediate representation, eliminates the link-time inter-procedural optimization phase entirely, and minimizes code re-generation overhead, thus improving scalability by an order of magnitude. Compared to an FDO baseline, and without further specific tuning, LIPO improves performance of SPEC2006 INT
by 2.5%, and of SPEC2000 INT by 4.4%, with up to 23% for one benchmarks. We confirm
our scalability results on a set of large industrial applications, demonstrating 2.9%
performance improvements on average. Compile time overhead for full builds is less than 30%, incremental builds take a few seconds on average, and storage requirements increase by only 24%, all compared to the FDO baseline.
View details