Princeton University

School of Engineering & Applied Science

Erasure Codes for Optimal Repairs in Distributed Storage Systems

Sreechakra Goparaju
Engineering Quadrangle J401
Monday, October 20, 2014 - 3:00pm to 4:30pm

Google, Amazon, and other services store data in multiple geographically separated disks called nodes, among other reasons, to safeguard the data from node failures. A standard technique for such a distributed way of storage is to store multiple backups (typically each file is replicated thrice). This especially makes sense in light of a Moore’s law-esque drop in the data storage cost in the last few decades. However, between a simultaneous exponential growth in data itself — by 2020, the digital universe is estimated to contain nearly as many digital bits as there are stars in the universe — and the overhead costs of maintaining the cloud of storage, the efficacy of wide-scale replication increasingly becomes questionable. 
To protect a file against data loss, could one perhaps store only as much additional data as the amount of data loss that one expects in the worst case? (Notice, for example, that a file consisting of two symbols needs to store four extra symbols, that is by replicating twice, if it is to be protected against the loss of any two symbols.) Erasure codes such as Reed-Solomon codes together with information dispersal algorithms enable this, potentially saving the cloud storage industry millions of dollars, were it not for one phenomenon: storage nodes fail often but largely only fail one at a time. Why is this a stumbling block? A policy debate ensued to decide whether replication or erasure coding is the optimal choice for distributed storage. The crux of this debate is the problem of tackling the aforementioned single node failures to maintain the worst-case resilience of a system, abbreviated in this dissertation as the repair problem.
This dissertation looks at two erasure codes to approach the repair problem: regenerating codes, which optimize the communication costs, and locally repairable code​s, which optimize the I/O costs; and presents new code constructions for each class of codes. Ideas developed here ​can also​ be applied to establish the optimal file size that can be securely stored in the presence of an eavesdropper.