Amin Vahdat

Amin Vahdat

Amin Vahdat is an Engineering Fellow and Vice President for the Machine Leaning, Systems, and Cloud AI team. The team is responsible for product and engineering across:
  • Compute (Google Compute Engine, Borg/Cluster Scheduling, Operating Systems and Kernel)
  • Platforms (TPUs, GPUs, Servers, Storage, and Networking)
  • Cloud AI and Core ML (Vertex AI, training, serving, compilers, frameworks)
  • Network Infrastructure (Datacenter, Campus, RPC, and End Host network software)
  • Vahdat is active in Computer Science research, with more than 54,000 citations to over 200 refereed publications across cloud infrastructure, software defined networking, data consistency, operating systems, storage systems, data center architecture, and optical networking.

    In the past, he was the SAIC Professor of Computer Science and Engineering at UC San Diego. Vahdat received his PhD from UC Berkeley in Computer Science. Vahdat is an ACM Fellow and a member of the National Academy of Engineering. He has been recognized with a number of awards, including the National Science Foundation CAREER award, the Alfred P. Sloan Fellowship, the Duke University David and Janet Vaughn Teaching Award, the UC Berkeley Distinguished EECS Alumni Award, and the SIGCOMM Lifetime Achievement Award.

    Authored Publications
    Sort By
    • Title
    • Title, descending
    • Year
    • Year, descending
      Preview abstract We describe our experience with Fathom, a system for identifying the network performance bottlenecks of any service running in the Google fleet. Fathom passively samples RPCs, the principal unit of work for services. It segments the overall latency into host and network components with kernel and RPC stack instrumentation. It records these detailed latency metrics, along with detailed transport connection state, for every sampled RPC. This lets us determine if the completion is constrained by the client, network or server. To scale while enabling analysis, we also aggregate samples into distributions that retain multi-dimensional breakdowns. This provides us with a macroscopic view of individual services. Fathom runs globally in our datacenters for all production traffic, where it monitors billions of TCP connections 24x7. For five years Fathom has been our primary tool for troubleshooting service network issues and assessing network infrastructure changes. We present case studies to show how it has helped us improve our production services. View details
      Change Management in Physical Network Lifecycle Automation
      Virginia Beauregard
      Kevin Grant
      Angus Griffith
      Jahangir Hasan
      Chen Huang
      Quan Leng
      Jiayao Li
      Alexander Lin
      Zhoutao Liu
      Ahmed Mansy
      Bill Martinusen
      Nikil Mehta
      Andrew Narver
      Anshul Nigham
      Melanie Obenberger
      Sean Smith
      Kurt Steinkraus
      Sheng Sun
      Edward Thiele
      Proc. 2023 USENIX Annual Technical Conference (USENIX ATC 23)
      Preview abstract Automated management of a physical network's lifecycle is critical for large networks. At Google, we manage network design, construction, evolution, and management via multiple automated systems. In our experience, one of the primary challenges is to reliably and efficiently manage change in this domain -- additions of new hardware and connectivity, planning and sequencing of topology mutations, introduction of new architectures, new software systems and fixes to old ones, etc. We especially have learned the importance of supporting multiple kinds of change in parallel without conflicts or mistakes (which cause outages) while also maintaining parallelism between different teams and between different processes. We now know that this requires automated support. This paper describes some of our network lifecycle goals, the automation we have developed to meet those goals, and the change-management challenges we encountered. We then discuss in detail our approaches to several specific kinds of change management: (1) managing conflicts between multiple operations on the same network; (2) managing conflicts between operations spanning the boundaries between networks; (3) managing representational changes in the models that drive our automated systems. These approaches combine both novel software systems and software-engineering practices. While this paper reports on our experience with large-scale datacenter network infrastructures, we are also applying the same tools and practices in several adjacent domains, such as the management of WAN systems, of machines, and of datacenter physical designs. Our approaches are likely to be useful at smaller scales, too. View details
      CAPA: An Architecture For Operating Cluster Networks With High Availability
      Bingzhe Liu
      Mukarram Tariq
      Omid Alipourfard
      Rich Alimi
      Deepak Arulkannan
      Virginia Beauregard
      Patrick Conner
      Brighten Godfrey
      Xander Lin
      Mayur Patel
      Joon Ong
      Amr Sabaa
      Alex Smirnov
      Manish Verma
      Prerepa Viswanadham
      Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (2023)
      Preview abstract Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years View details
      Improving Network Availability with Protective ReRoute
      Abdul Kabbani
      Van Jacobson
      Jim Winget
      Brad Morrey
      Uma Parthavi Moravapalle
      Steven Knight
      SIGCOMM 2023
      Preview abstract We present PRR (Protective ReRoute), a transport technique for shortening user-visible outages that complements routing repair. It can be added to any transport to provide benefits in multipath networks. PRR responds to flow connectivity failure signals, e.g., retransmission timeouts, by changing the FlowLabel on packets of the flow, which causes switches and hosts to choose a different network path that may avoid the outage. To enable it, we shifted our IPv6 network architecture to use the FlowLabel, so that hosts can change the paths of their flows without application involvement. PRR is deployed fleetwide at Google for TCP and Pony Express, where it has been protecting all production traffic for several years. It is also available to our Cloud customers. We find it highly effective for real outages. In a measurement study on our network backbones, adding PRR reduced the cumulative region-pair outage time for RPC traffic by 63--84%. This is the equivalent of adding 0.4--0.8 "nines'" of availability. View details
      Carbink: Fault-tolerant Far Memory
      Yang Zhou
      Sihang Liu
      Jiaqi Gao
      James Mickens
      Minlan Yu
      Hank Levy
      Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, Usenix (2022)
      Preview abstract Memory-intensive applications would benefit from using available memory from other machines (ie, remote memory or far memory). However, there is a missing piece in recent far memory proposals -- cost-efficient fault tolerance for far memory. In this paper, we motivate the strong need for fault tolerance for far memory using machine/task failure statistics from a major internet service provider. Then we describe the design and implementation off a Fault-Tolerant application-integrated Far Memory (i.e., FTFM) framework. We compare several candidate fault tolerance schemes, and discuss their pros and cons. Finally, we test FTFM using several X-internal applications, including graph processing, globally-distributed database, and in-memory database. Our results show that FTFM has little impact on application performance (~x.x%), while achieving xx% performance of running applications purely in local memory. View details
      Preview abstract Traffic load balancing across multiple paths is a critical task for modern networks to reduce network congestion and improve network efficiency. Hashing which is the foundation of traffic load balancing still faces practical challenges. The key problem is there is a growing need for more hash functions because networks are getting larger with more switches, more stages and increased path diversity. Meanwhile topology and routing becomes more agile in order to efficiently serve traffic demands with stricter throughput and latency SLAs. On the other hand, current generation switch chips only provide a limited number of uncorrelated hash functions. We first demonstrate why the limited number of hashing functions is a practical challenge in today's datacenter network (DCN) and wide-area network (WAN) designs. Then, to mitigate the problem, we propose a novel approach named \textsl{color recombining} which enables hash functions reuse via leveraging topology traits of multi-stage DCN networks. We also describe a novel framework based on \textsl{\coprime} theory to mitigate hash correlation in generic mesh topologies (i.e., spineless DCN and WAN). Our evaluation on real network trace data and topologies demonstrate that we can reduce the extent of load imbalance (measured by coefficient of variation) by an order of magnitude. View details
      Aquila: A unified, low-latency fabric for datacenter networks
      Hema Hariharan
      Eric Lance
      Moray Mclaren
      Stephen Wang
      Zhehua Wu
      Sunghwan Yoo
      Raghuraman Balasubramanian
      Prashant Chandra
      Michael Cutforth
      Peter James Cuy
      David Decotigny
      Rakesh Gautam
      Rick Roy
      Zuowei Shen
      Ming Tan
      Ye Tang
      Monica C Wong-Chan
      Joe Zbiciak
      Aquila: A unified, low-latency fabric for datacenter networks (2022)
      Preview abstract Datacenter workloads have evolved from the data intensive, loosely-coupled workloads of the past decade to more tightly coupled ones, wherein ultra-low latency communication is essential for resource disaggregation over the network and to enable emerging programming models. We introduce Aquila, an experimental datacenter network fabric built with ultra-low latency support as a first-class design goal, while also supporting traditional datacenter traffic. Aquila uses a new Layer 2 cell-based protocol, GNet, an integrated switch, and a custom ASIC with low-latency Remote Memory Access (RMA) capabilities co-designed with GNet. We demonstrate that Aquila is able to achieve under 40 μs tail fabric Round Trip Time (RTT) for IP traffic and sub-10 μs RMA execution time across hundreds of host machines, even in the presence of background throughput-oriented IP traffic. This translates to more than 5x reduction in tail latency for a production quality key-value store running on a prototype Aquila network. View details
      Understanding Host Interconnect Congestion
      Khaled Elmeleegy
      Masoud Moshref
      Rachit Agarwal
      Saksham Agarwal
      Sylvia Ratnasamy
      Association for Computing Machinery, New York, NY, USA (2022), 198–204
      Preview abstract We present evidence and characterization of host congestion in production clusters: adoption of high-bandwidth access links leading to emergence of bottlenecks within the host interconnect (NIC-to-CPU data path). We demonstrate that contention on existing IO memory management units and/or the memory subsystem can significantly reduce the available NIC-to-CPU bandwidth, resulting in hundreds of microseconds of queueing delays and eventual packet drops at hosts (even when running a state-of-the-art congestion control protocol that accounts for CPU-induced host congestion). We also discuss implications of host interconnect congestion to design of future host architecture, network stacks and network protocols. View details
      Preview abstract A modern datacenter hosts thousands of services with a mix of latency-sensitive, throughput-intensive, and best-effort traffic with high degrees of fan-out and fan-in patterns. Maintaining low tail latency under high overload conditions is difficult, especially for latency-sensitive (LS) RPCs. In this paper, we consider the challenging case of providing service-level objectives (SLO) to LS RPCs when there are unpredictable surges in demand. We present Aequitas, a distributed sender-driven admission control scheme that is anchored on the key conceptual insight: Weighted-Fair Quality of Service (QoS) queues, found in standard NICs and switches, can be used to guarantee RPC level latency SLOs by a judicious selection of QoS weights and traffic-mix across QoS queues. Aequitas installs cluster-wide RPC latency SLOs by mapping LS RPCs to higher weight QoS queues, and coping with overloads by adaptively apportioning LS RPCs amongst QoS queues based on measured completion times for each queue. When the network demand spikes unexpectedly to 25× of provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-p and admits 15× more RPCs meeting SLO target compared to pFabric when RPC sizes are not aligned with priorities. View details
      Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking
      Joon Ong
      Arjun Singh
      Mukarram Tariq
      Rui Wang
      Jianan Zhang
      Virginia Beauregard
      Patrick Conner
      Rishi Kapoor
      Stephen Kratzer
      Nanfang Li
      Hong Liu
      Karthik Nagaraj
      Jason Ornstein
      Samir Sawhney
      Ryohei Urata
      Lorenzo Vicisano
      Kevin Yasumura
      Shidong Zhang
      Junlan Zhou
      Proceedings of ACM SIGCOMM 2022
      Preview abstract We present a decade of evolution and production experience with Jupiter datacenter network fabrics. In this period Jupiter has delivered 5x higher speed and capacity, 30% reduction in capex, 41% reduction in power, incremental deployment and technology refresh all while serving live production traffic. A key enabler for these improvements is evolving Jupiter from a Clos to a direct-connect topology among the machine aggregation blocks. Critical architectural changes for this include: A datacenter interconnection layer employing Micro-ElectroMechanical Systems (MEMS) based Optical Circuit Switches (OCSes) to enable dynamic topology reconfiguration, centralized Software-Defined Networking (SDN) control for traffic engineering, and automated network operations for incremental capacity delivery and topology engineering. We show that the combination of traffic and topology engineering on direct-connect fabrics achieves similar throughput as Clos fabrics for our production traffic patterns. We also optimize for path lengths: 60% of the traffic takes direct path from source to destination aggregation blocks, while the remaining transits one additional block, achieving an average blocklevel path length of 1.4 in our fleet today. OCS also achieves 3x faster fabric reconfiguration compared to pre-evolution Clos fabrics that used a patch panel based interconnect. View details