Christoph Albrecht

Christoph Albrecht

Christoph Albrecht works as Software Engineer in the System Level Software Team for the TPU (Tensor Processor Unit) Machine Learning Supercomputer improving the availability, resiliency and performance. Previously, he worked at Waymo on motion planning for autonomous vehicles. Before that he worked in Google’s Network Architecture team on the simulation, scaling and optimization of Google’s backbone networks. He also worked in the Quantitative Infrastructure team at Google where he analyzed and optimized distributed storage systems. Prior to working at Google, he worked in electronic design automation, at the Cadence Research Laboratories at Berkeley. He holds the Ph.D. degree in Mathematics from the University of Bonn, the Research Institute for Discrete Mathematics.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Capacity planning for the Google backbone network
    Ajay Kumar Bangla
    Ben Preskill
    Emilie Danna
    Xiaoxue Zhao
    ISMP 2015 (International Symposium on Mathematical Programming) (to appear)
    Preview abstract Google operates one of the largest backbone networks in the world. In this talk, we present optimization and simulation techniques we use to design the network topology and provision its capacity to achieve conflicting objectives such as scale, cost, availability, and latency. View details
    Janus: Optimal Flash Provisioning for Cloud Storage Workloads
    Murray Stokely
    Muhammad Waliji
    Francois Labelle
    Xudong Shi
    Eric Schrock
    Proceedings of the USENIX Annual Technical Conference, USENIX, Advanced Computing System Association, 2560 Ninth Street, Suite 215, Berkeley, CA 94710, USA (2013), pp. 91-102
    Preview abstract Janus is a system for partitioning the flash storage tier between workloads in a cloud-scale distributed file system with two tiers, flash storage and disk. The file system stores newly created files in the flash tier and moves them to the disk tier using either a First-In-First-Out (FIFO) policy or a Least-Recently-Used (LRU) policy, subject to per-workload allocations. Janus constructs compact metrics of the cacheability of the different workloads, using sampled distributed traces because of the large scale of the system. From these metrics, we formulate and solve an optimization problem to determine the flash allocation to workloads that maximizes the total reads sent to the flash tier, subject to operator-set priorities and bounds on flash write rates. Using measurements from production workloads in multiple data centers using these recommendations, as well as traces of other production workloads, we show that the resulting allocation improves the flash hit rate by 47–76% compared to a unified tier shared by all workloads. Based on these results and an analysis of several thousand production workloads, we conclude that flash storage is a cost-effective complement to disks in data centers. View details
    Projecting Disk Usage Based on Historical Trends in a Cloud Environment
    Murray Stokely
    Amaan Mehrabian
    Francois Labelle
    ScienceCloud 2012 Proceedings of the 3rd International Workshop on Scientific Cloud Computing, ACM, pp. 63-70
    Preview abstract Provisioning scarce resources among competing users and jobs remains one of the primary challenges of operating large-scale, distributed computing environments. Distributed storage systems, in particular, typically rely on hard operator-set quotas to control disk allocation and enforce isolation for space and I/O bandwidth among disparate users. However, users and operators are very poor at predicting future requirements and, as a result, tend to over-provision grossly. For three years, we collected detailed usage information for data stored in distributed filesystems in a large private cloud spanning dozens of clusters on multiple continents. Specifically, we measured the disk space usage, I/O rate, and age of stored data for thousands of different engineering users and teams. We find that although the individual timeseries often have non-stable usage trends, regional aggregations, user classification, and ensemble forecasting methods can be combined to provide a more accurate prediction of future use for the majority of users. We applied this methodology for the storage users in one geographic region and back-tested these techniques over the past three years to compare our forecasts against actual usage. We find that by classifying a small subset of users with unforecastable trend changes due to known product launches, we can generate three-month out forecasts with mean absolute errors of less than ~12%. This compares favorably to the amount of allocated but unused quota that is generally wasted with manual operator-set quotas. View details