Christoph Albrecht
Christoph Albrecht works as Software Engineer in the System Level Software Team for the TPU (Tensor Processor Unit) Machine Learning Supercomputer improving the availability, resiliency and performance.
Previously, he worked at Waymo on motion planning for autonomous vehicles.
Before that he worked in Google’s Network Architecture team on the simulation, scaling and optimization of Google’s backbone networks. He also worked in the Quantitative Infrastructure team at Google where he analyzed and optimized distributed storage systems. Prior to working at Google, he worked in electronic design automation, at the Cadence Research Laboratories at Berkeley. He holds the Ph.D. degree in Mathematics from the University of Bonn, the Research Institute for Discrete Mathematics.
Authored Publications
Sort By
Capacity planning for the Google backbone network
Ajay Kumar Bangla
Ben Preskill
Emilie Danna
Xiaoxue Zhao
ISMP 2015 (International Symposium on Mathematical Programming) (to appear)
Preview abstract
Google operates one of the largest backbone networks in the world. In this
talk, we present optimization and simulation techniques we use to design the
network topology and provision its capacity to achieve conflicting objectives
such as scale, cost, availability, and latency.
View details
Janus: Optimal Flash Provisioning for Cloud Storage Workloads
Murray Stokely
Muhammad Waliji
Francois Labelle
Xudong Shi
Eric Schrock
Proceedings of the USENIX Annual Technical Conference, USENIX, Advanced Computing System Association, 2560 Ninth Street, Suite 215, Berkeley, CA 94710, USA (2013), pp. 91-102
Preview abstract
Janus is a system for partitioning the flash storage tier between workloads in a cloud-scale distributed file system with two tiers, flash storage and disk. The file system stores newly created files in the flash tier and moves them to the disk tier using either a First-In-First-Out (FIFO) policy or a Least-Recently-Used (LRU) policy, subject to per-workload allocations. Janus constructs compact metrics of the cacheability of the different workloads, using sampled distributed traces because of the large scale of the system. From these metrics, we formulate and solve an optimization problem to determine the flash allocation to workloads that maximizes the total reads sent to the flash tier, subject to operator-set priorities and bounds on flash write rates. Using measurements from production workloads in multiple data centers using these recommendations, as well as traces of other production workloads, we show that the resulting allocation improves the flash hit rate by 47–76% compared to a unified tier shared by all workloads. Based on these results and an analysis of several thousand production workloads, we conclude that flash storage is a cost-effective complement to disks in data centers.
View details
Projecting Disk Usage Based on Historical Trends in a Cloud Environment
Murray Stokely
Amaan Mehrabian
Francois Labelle
ScienceCloud 2012 Proceedings of the 3rd International Workshop on Scientific Cloud Computing, ACM, pp. 63-70
Preview abstract
Provisioning scarce resources among competing users and jobs remains one of the primary challenges of operating large-scale, distributed computing environments. Distributed storage systems, in particular, typically rely on hard operator-set quotas to control disk allocation and enforce isolation for space and I/O bandwidth among disparate
users. However, users and operators are very poor at predicting future requirements and, as a result, tend to over-provision grossly.
For three years, we collected detailed usage information for data stored in distributed filesystems in a large private cloud spanning dozens of clusters on multiple continents. Specifically, we measured the disk space usage, I/O rate, and age of stored data for thousands of different engineering users and teams. We find that although the
individual timeseries often have non-stable usage trends, regional aggregations, user classification, and ensemble forecasting methods can be combined to provide a more accurate prediction of future use for the majority of users.
We applied this methodology for the storage users in one
geographic region and back-tested these techniques over the past three years to compare our forecasts against actual usage. We find that by classifying a small subset of users with unforecastable trend changes due to known product launches, we can generate three-month out forecasts with mean absolute errors of less than ~12%. This compares favorably to the amount of allocated but unused quota that is generally wasted with manual operator-set quotas.
View details