Arif Merchant

Arif Merchant

Arif Merchant is a Research Scientist with the Storage Analytics group at Google, where he studies interactions between components of the storage stack. Prior to this, he was with HP Labs, where he worked on storage QoS, distributed storage systems, and stochastic models of storage. He holds the B.Tech. degree from IIT Bombay and the Ph.D. in Computer Science from Stanford University. He is an ACM Distinguished Scientist.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)
    Shashwat Silas
    Dave Clausen
    File and Storage Technologies (FAST), USENIX (2023)
    Preview abstract Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures. We conduct a practically-minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations, and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers. View details
    CacheSack: Admission Optimization for Datacenter Flash Caches
    Homer Wolfmeister
    Mustafa Uysal
    Seth Bradley Pollen
    Tzu-Wei Yang
    2022 USENIX Annual Technical Conference (2022) (to appear)
    Preview abstract This paper describes the algorithm, implementation, and deployment experience of CacheSack, the admission algorithm for Google datacenter flash caches. CacheSack minimizes the dominant costs of Google’s datacenter flash caches: disk IO and flash footprint. CacheSack partitions cache traffic into disjoint categories, analyzes the observed cache benefit of each subset, and formulates a knapsack problem to assign the optimal admission policy to each subset. Prior to this work, Google datacenter flash cache admission policies were optimized manually, with most caches using the Lazy Adaptive Replacement Cache (LARC) algorithm. Production experiments showed that CacheSack significantly outperforms the prior static admission policies for a 6.5% improvement of the total operational cost, as well as significant improvements in disk reads and flash wearout. View details
    Tiger: disk-adaptive redundancy without placement restrictions
    Francisco Maturana
    Sanjith Athlur
    Rashmi KV
    Gregory R. Ganger
    Tiger: disk-adaptive redundancy without placement restrictions (2022)
    Preview abstract Large-scale cluster storage systems use redundancy (via erasure coding) to ensure data durability. Disk-adaptive redundancy—dynamically tailoring the redundancy scheme to observed disk failure rates—promises significant space and cost savings. Existing disk-adaptive redundancy systems, however, pose undesirable constraints on data placement, partitioning disks into subclusters with homogeneous failure rates and forcing each erasure-coded stripe to be entirely placed on the disks within one subcluster. This design increases risk, by reducing intra-stripe diversity and being more susceptible to unanticipated changes in a make/model’s failure rate, and only works for very large storage clusters fully committed to disk-adaptive redundancy. Tiger is a new disk-adaptive redundancy system that efficiently avoids adoption-blocking placement constraints, while also providing higher space-savings and lower risk relative to prior designs. To do so, Tiger introduces the eclectic stripe, in which disks with different failure rates can be used to store a stripe that has redundancy tailored to the set of failure rates of those disks. With eclectic stripes, pre-existing placement policies can be used while still enjoying the space-savings and robustness benefits of disk-adaptive redundancy. This paper introduces eclectic striping and Tiger’s design, including a new mean time-to-data-loss (MTTDL) approximation technique and new approaches for ensuring safe per-stripe settings given that failure rates of different devices change over time. Evaluation with logs from real-world clusters show that Tiger provides better space-savings, less bursty IO for changing redundancy schemes, and better robustness (due to increased risk-diversity) than prior disk-adaptive redundancy designs. View details
    Reliability of nand-Based SSDs: What Field Studies Tell Us
    Bianca Schroeder
    Raghav Lagisetty
    Proceedings of the IEEE (2017)
    Preview abstract Solid-state drives (SSDs) based on NAND flash are making deep inroads into data centers as well as the consumer market. In 2016, manufacturers shipped more than 130 million units totaling around 50 Exabytes of storage capacity. As the amount of data stored on solid state drives keeps increasing, it is important to understand the reliability characteristics of these devices. For a long time, our knowledge about flash reliability was derived from controlled experiments in lab environments under synthetic workloads, often using methods for accelerated testing. However, within the last two years, three large-scale field studies have been published that report on the failure behavior of flash devices in production environments subjected to real workloads and operating conditions. The goal of this paper is to provide an overview of what we have learned about flash reliability in production, and where appropriate contrasting it with prior studies performing controlled experiments. View details
    Preview abstract The use of solid state drives based on NAND flash technology is continuously growing. As more data either lives on flash or is being cached on flash, data durability and availability critically depend on flash reliability. This paper provides a detailed field study of flash reliability based on data collected over 6 years in a large-scale data center production environment. The data spans many millions of drive days, ten different drive models, different flash technologies (MLC and SLC) and feature sizes (ranging from 24nm to 50nm). The paper analyses this data in order to derive a better understanding of flash reliability in the field, including the most prevalent types of errors and hardware failures and their frequency, and how different factors impact flash reliability. View details
    Slicer: Auto-Sharding for Datacenter Applications
    Atul Adya
    Jon Howell
    Jeremy Elson
    Colin Meek
    Vishesh Khemani
    Stefan Fulger
    Pan Gu
    Lakshminath Bhuvanagiri
    Jason Hunter
    Roberto Peon
    Alexander Shraer
    Kfir Lev-Ari
    OSDI 2016 (2016)
    Preview abstract Sharding is a fundamental building block of large-scale applications, but most have their own custom, ad-hoc implementations. Our goal is to make sharding as easily reusable as a filesystem or lock manager. Slicer is \Google's general purpose sharding service. It monitors signals such as load hotspots and server health and dynamically shards work over a set of servers. Its goals are to maintain high availability and reduce load imbalance while minimizing churn from moved work. In this paper, we describe Slicer's design and implementation. Slicer has the consistency and global optimization of a centralized sharder while approaching the high availability, scalability, and low latency of systems that make local decisions. It achieves this by separating concerns: a reliable data plane forwards requests, and a smart control plane makes load-balancing decisions off the critical path. Slicer's small but powerful API has proven useful and easy to adopt in dozens of \Google applications. It is used to allocate resources for web service front-ends, coalesce writes to increase storage bandwidth, and increase the efficiency of a web cache. It currently handles 2-6M~req/s of production traffic. Production workloads using Slicer exhibit a most-loaded task 30\%--180\% of the mean load, even for highly skewed and time-varying loads. View details
    Poster Paper: Automatic Reconfiguration of Distributed Storage
    Artyom Sharov
    Alexander Shraer
    Murray Stokely
    The 12th International Conference on Autonomic Computing, IEEE (2015), pp. 133-134
    Preview abstract The configuration of a distributed storage system with multiple data replicas typically includes the set of servers and their roles in the replication protocol. The configuration can usually be changed manually, but in most cases, system administrators have to determine a good configuration by trial and error. We describe a new workload-driven optimization framework that dynamically determines the optimal configuration at run time. Applying the framework to a large-scale distributed storage system used internally in Google resulted in halving the operation latency in 17% of the tested databases, and reducing it by more than 90% in some cases. View details
    Take me to your leader! Online Optimization of Distributed Storage Configurations
    Artyom Sharov
    Alexander Shraer
    Murray Stokely
    Proceedings of the 41st International Conference on Very Large Data Bases, VLDB Endowment (2015), pp. 1490-1501
    Preview abstract The configuration of a distributed storage system typically includes, among other parameters, the set of servers and their roles in the replication protocol. Although mechanisms for changing the configuration at runtime exist, it is usually left to system administrators to manually determine the “best” configuration and periodically reconfigure the system, often by trial and error. This paper describes a new workload-driven optimization framework that dynamically determines the optimal configuration at runtime. We focus on optimizing leader and quorum based replication schemes and divide the framework into three optimization tiers, dynamically optimizing different configuration aspects: 1) leader placement, 2) roles of different servers in the replication protocol, and 3) replica locations. We showcase our optimization framework by applying it to a large-scale distributed storage system used internally in Google and demonstrate that most client applications significantly benefit from using our framework, reducing average operation latency by up to 94%. View details
    Janus: Optimal Flash Provisioning for Cloud Storage Workloads
    Murray Stokely
    Muhammad Waliji
    Francois Labelle
    Xudong Shi
    Eric Schrock
    Proceedings of the USENIX Annual Technical Conference, USENIX, Advanced Computing System Association, 2560 Ninth Street, Suite 215, Berkeley, CA 94710, USA (2013), pp. 91-102
    Preview abstract Janus is a system for partitioning the flash storage tier between workloads in a cloud-scale distributed file system with two tiers, flash storage and disk. The file system stores newly created files in the flash tier and moves them to the disk tier using either a First-In-First-Out (FIFO) policy or a Least-Recently-Used (LRU) policy, subject to per-workload allocations. Janus constructs compact metrics of the cacheability of the different workloads, using sampled distributed traces because of the large scale of the system. From these metrics, we formulate and solve an optimization problem to determine the flash allocation to workloads that maximizes the total reads sent to the flash tier, subject to operator-set priorities and bounds on flash write rates. Using measurements from production workloads in multiple data centers using these recommendations, as well as traces of other production workloads, we show that the resulting allocation improves the flash hit rate by 47–76% compared to a unified tier shared by all workloads. Based on these results and an analysis of several thousand production workloads, we conclude that flash storage is a cost-effective complement to disks in data centers. View details
    Projecting Disk Usage Based on Historical Trends in a Cloud Environment
    Murray Stokely
    Amaan Mehrabian
    Francois Labelle
    ScienceCloud 2012 Proceedings of the 3rd International Workshop on Scientific Cloud Computing, ACM, pp. 63-70
    Preview abstract Provisioning scarce resources among competing users and jobs remains one of the primary challenges of operating large-scale, distributed computing environments. Distributed storage systems, in particular, typically rely on hard operator-set quotas to control disk allocation and enforce isolation for space and I/O bandwidth among disparate users. However, users and operators are very poor at predicting future requirements and, as a result, tend to over-provision grossly. For three years, we collected detailed usage information for data stored in distributed filesystems in a large private cloud spanning dozens of clusters on multiple continents. Specifically, we measured the disk space usage, I/O rate, and age of stored data for thousands of different engineering users and teams. We find that although the individual timeseries often have non-stable usage trends, regional aggregations, user classification, and ensemble forecasting methods can be combined to provide a more accurate prediction of future use for the majority of users. We applied this methodology for the storage users in one geographic region and back-tested these techniques over the past three years to compare our forecasts against actual usage. We find that by classifying a small subset of users with unforecastable trend changes due to known product launches, we can generate three-month out forecasts with mean absolute errors of less than ~12%. This compares favorably to the amount of allocated but unused quota that is generally wasted with manual operator-set quotas. View details