Authored Publications
Sort By
Preview abstract
While many network research papers address issues of deployability, with
a few exceptions, this has been limited to protocol compatibility or
switch-resource constraints, such as flow table sizes.
We argue that good network designs must also consider the costs and
complexities of deploying the design within the constraints of the physical
environment in a datacenter: \emph{physical} deployability.
The traditional metrics of network ``goodness'' mostly do not account
for these costs and constraints, and this may partially explain why some
otherwise attractive designs have not been deployed in real-world datacenters.
View details
Data-driven Networking Research: models for academic collaboration with Industry (a Google point of view)
Computer Communication Review, 51:4 (2021), pp. 47-49
Preview abstract
We (Google's networking teams) would like to increase our collaborations with academic researchers related to data-driven networking research.
There are some significant constraints on our ability to directly share data, and in case not everyone in the community understands these, this document provides a brief summary.
There are some models which can work (primarily, interns and visiting scientists).
We describe some specific areas where we would welcome proposals to work within those models
View details
Autopilot: Workload Autoscaling at Google Scale
Paweł Findeisen
Jacek Świderski
Przemyslaw Broniek
Beata Strack
Piotr Witusowski
Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery (2020) (to appear)
Preview abstract
In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage.
To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack – the difference between the limit and the actual resource usage – while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.
Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage.
View details
Borg: the Next Generation
Muhammad Tirmazi
Adam Barker
Md Ehtesam Haque
Zhijing Gene Qin
Mor Harchol-Balter
EuroSys'20, ACM, Heraklion, Crete (2020)
Preview abstract
This paper analyzes a newly-published trace that covers 8
different Borg clusters for the month of May 2019. The
trace enables researchers to explore how scheduling works in
large-scale production compute clusters. We highlight how
Borg has evolved and perform a longitudinal comparison of
the newly-published 2019 trace against the 2011 trace, which
has been highly cited within the research community.
Our findings show that Borg features such as alloc sets
are used for resource-heavy workloads; automatic vertical
scaling is effective; job-dependencies account for much of
the high failure rates reported by prior studies; the workload arrival rate has increased, as has the use of resource over-commitment; the workload mix has changed, jobs have
migrated from the free tier into the best-effort batch tier;
the workload exhibits an extremely heavy-tailed distribution
where the top 1% of jobs consume over 99% of resources; and
there is a great deal of variation between different clusters.
View details
Preview abstract
Cloud customers want reliable, understandable promises from cloud providers that their applications will run reliably and with adequate performance, but today, providers offer only limited guarantees, which creates uncertainty for customers. Providers also must define internal metrics to allow them to operate their systems without violating customer promises or expectations. We explore why these guarantees are hard to define. We show that this problem shares some similarities with the challenges of applying statistics to make decisions based on sampled data. We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems. Overall, we offer a partial framework for thinking about Service Level Objectives (SLOs), and discuss some unsolved challenges.
View details
DieHard: reliable scheduling to survive correlated failures in cloud data centers
Mina Sedaghat
Eddie Wadbro
Sara De Luna
Oleg Seleznjev
Erik Elmroth
International Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE/ACM, Cartagena, Colombia (2016), pp. 52-59
Preview abstract
In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job.
This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job’s reliability in the presence of correlated failures.
In addition, we address the problem of scheduling a job with reliability constraints. We formulate the scheduling problem as an optimization problem, with the aim being to achieve the desired reliability with the minimum number of extra tasks. We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability.
We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job’s reliability.
View details
Service Level Objectives
Preview
Niall Murphy
Cody Smith
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Preview abstract
Lessons learned from three container management systems over a decade.
View details
Large-scale cluster management at {Google} with {Borg}
Luis Pedrosa
Madhukar R. Korupolu
David Oppenheimer
Proceedings of the European Conference on Computer Systems (EuroSys), ACM, Bordeaux, France (2015)
Preview abstract
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.
It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.
We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.
View details
Long-term {SLOs} for reclaimed cloud computing resources
Marcus Carvalho
Franciso Brasileiro
ACM Symposium on Cloud Computing (SoCC), ACM, Seattle, WA, USA (2014), 20:1-20:13
Preview abstract
The elasticity promised by cloud computing does not come for free. Providers need to reserve resources to allow users to scale on demand, and cope with workload variations, which results in low utilization. The current response to this low utilization is to re-sell unused resources with no Service Level Objectives (SLOs) for availability. In this paper, we show how to make some of these reclaimable resources more valuable by providing strong, long-term availability SLOs for them. These SLOs are based on forecasts of how many resources will remain unused during multi-month periods, so users can do capacity planning for their long-running services. By using confidence levels for the predictions, we give service providers control over the risk of violating the availability SLOs, and allow them trade increased risk for more resources to make available. We evaluated our approach using 45 months of workload data from 6 production clusters at Google, and show that 6--17% of the resources can be re-offered with a long-term availability of 98.9% or better. A conservative analysis shows that doing so may increase the profitability of selling reclaimed resources by 22--60%.
View details