Carbink: Fault-tolerant Far Memory

Yang Zhou
Sihang Liu
Jiaqi Gao
James Mickens
Minlan Yu
Hank Levy
Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, Usenix (2022)

Abstract

Memory-intensive applications would benefit from using available memory from other machines (ie, remote memory or far memory). However, there is a missing piece in recent far memory proposals -- cost-efficient fault tolerance for far memory. In this paper, we motivate the strong need for fault tolerance for far memory using machine/task failure statistics from a major internet service provider. Then we describe the design and implementation off a Fault-Tolerant application-integrated Far Memory (i.e., FTFM) framework. We compare several candidate fault tolerance schemes, and discuss their pros and cons. Finally, we test FTFM using several X-internal applications, including graph processing, globally-distributed database, and in-memory database. Our results show that FTFM has little impact on application performance (~x.x%), while achieving xx% performance of running applications purely in local memory.