Sriraman Tallam
Sriraman Tallam is a member of the LLVM compiler optimization team at Google, Sunnyvale, CA where he works on compiler techniques to make Google applications more efficient. He obtained a PhD in Computer Science from the University of Arizona.
Research Areas
Authored Publications
Sort By
DeduBB: Binary Code Size Reduction via Post-Link Basic Block De-duplication
Chaitanya Mamatha Ananda
Rajiv Gupta
Mahbod Afarin
Han Shen
LCTES (Languages, Compilers, Tools and Theory of Embedded Systems) (2026) (to appear)
Preview abstract
Binary sizes of newer versions of software applications tend to be larger, primarily due to feature bloat. This poses various challenges, particularly for mobile applications. It affects
upgrade rates directly impacting revenues, increases maintenance costs of supporting multiple versions, and prevents some users from getting critical security fixes. Code bloat also poses a problem for large warehouse-scale applications. Such applications experience performance degradation when their code size exceeds what smaller and more efficient code
models can handle.
In this paper, we introduce a post-link optimization tech nique called DeduBB, which deduplicates basic blocks of an application across procedure boundaries. While prior tech-
niques used function outlining to de-duplicate redundant code sequences, it missed out on many opportunities as it cannot handle code that manipulates the program stack. In addition, previous techniques were either limited to the scope of a module or lacked scalable implementations required to handle large warehouse-scale applications. Our technique,
DeduBB, handles all types of code duplication as we use a novel save-and-jump code pattern to execute de-duplicated code blocks. In addition, DeduBB has been designed to work on scalable post-link optimizers and can even be applied to large warehouse-scale datacenter applications. Finally, DeduBB is profile-guided and can be applied selectively to infrequently executed cold basic blocks to not affect application performance. In fact, in several cases, the performance
of the smaller application binary improves due to reductions in its hot working set size. We have implemented our technique on the state-of-the-art post link optimizers, BOLT and Propeller. Experiments show that we can significantly reduce the code size of several benchmarks by 1.55% to 18.63%, on both Arm and x86 platforms, and on binaries that have already been heavily optimized for size using existing code size reduction features. Furthermore, aided by profiles, our technique can retain more than 80% of the maximal code size savings without affecting performance.
View details
PreFix: Optimizing the Performance of Heap-Intensive Applications
Chaitanya Mamatha Ananda
Rajiv Gupta
Han Shen
CGO 2025: International Symposium on Code Generation and Optimization, Las Vegas, NV, USA (to appear)
Preview abstract
Analyses of heap-intensive applications show that a small fraction of heap objects account for the majority of heap accesses and data cache misses. Prior works like HDS and HALO have shown that allocating hot objects in separate memory regions can improve spatial locality leading to better application performance. However, these techniques are constrained in two primary ways, limiting their gains. First, these techniques have Imperfect Separation, polluting the hot memory region with several cold objects. Second, reordering of objects across allocations is not possible as the original object allocation order is preserved. This paper presents a novel technique that achieves near perfect separation of hot objects via a new context mechanism that efficiently identifies hot objects with high precision. This technique, named PreFix, is based upon Preallocating memory for a Fixed small number of hot objects. The program, guided by profiles, is instrumented to compute context information derived from
dynamic object identifiers, that precisely identifies hot object allocations that are then placed at predetermined locations in the preallocated memory. The preallocated memory region for hot objects provides the flexibility to reorder objects across allocations and allows colocation of objects that are part of a hot data stream (HDS), improving spatial locality. The runtime overhead of identifying hot objects is not significant as this optimization is only focused on a small number of static hot allocation sites and dynamic hot objects. While there is an increase in the program’s memory foot-print, it is manageable and can be controlled by limiting the size of the preallocated memory. In addition, PreFix incorporates an object recycling optimization that reuses the same preallocated space to store different objects whose lifetimes are not expected to overlap. Our experiments with 13 heap-intensive applications yields reductions in execution times ranging from 2.77% to 74%. On average PreFix reduces execution time by 21.7% compared to 7.3% by HDS and 14% by HALO. This is due to PreFix’s precision in hot object identification, hot object colocation, and low runtime overhead.
View details
Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications
Han Shen
Krzysztof Pszeniczny
Rahman Lavaee
ACM, pp. 617-631
Preview abstract
While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks.
View details
Safe ICF: Pointer Safe and Unwinding Aware Identical Code Folding in Gold
Cary Coutant
Ian Lance Taylor
Chris Demetriou
GCC Developers Summit (2010)
Preview abstract
We have found that large C++ applications and shared libraries tend to have many functions whose code is identical with another function. As much as 10% of the code could theoretically be eliminated by merging such identical functions into a single copy. This optimization, Identical Code Folding (ICF), has been implemented in the gold linker. At link time, ICF detects functions with identical object code and merges them into a single copy. ICF can be unsafe, however, as it can change the run-time behaviour of code that relies on each function having a unique address. To address this, ICF can be used in a safe mode where it identifies and folds functions whose addresses are guaranteed not to have been used in comparison operations.
Further, profiling and debugging binaries with merged functions can be confusing, as the PC values of merged functions cannot be always disambiguated to point to the correct function. To address this, we propose a new call table format for the DWARF debugging information to allow tools like the debugger and profiler to disambiguate PC values of merged functions correctly by examining the call chain.
Detailed experiments on the x86 platform show that ICF can reduce the text size of a selection of Google binaries, whose average text size is 64 MB, by about 6%. Also, the code size savings of ICF with the safe option is almost as good as the code savings obtained without the safe option. Further, experiments also show that the run-time performance of the optimized binaries on the x86 platform does not change.
View details
Dynamic Recognition of Synchronization Operations for Improved Data Race Detection
Preview
Chen Tian
Vijay Nagarajan
Rajiv Gupta
Proc. International Symposium on Software Testing and Analysis, ACM, Seattle (2008), pp. 143-154