Failed disk recovery in double erasure RAID arrays

Kaushik Srinivasan; Charles Colbourn

doi:10.1016/j.jda.2006.03.003

Failed disk recovery in double erasure RAID arrays

Kaushik Srinivasan, Charles Colbourn

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

Reliability is a major concern in the design of large disk arrays. In this paper, we examine the effect of encountering more failures than that for which the RAID array was initially designed. Erasure codes are incorporated to enable system recovery from a specified number of disk erasures, and strive beyond that threshold to recover the system as frequently, and as thoroughly, as is possible. Erasure codes for tolerating two disk failures are examined. For these double erasure codes, we establish a correspondence between system operation and acyclicity of its graph model. For the most compact double erasure code, the full 2-code, this underlies an efficient algorithm for the computation of system operation probability (all disks operating or recoverable). When the system has failed, some disks are nonetheless recoverable. We extend the graph model to determine the probability that d disks have failed, a of which are recoverable by solving one linear equation, b of which are further recoverable by solving systems of linear equations, and d - a - b of which cannot be recovered. These statistics are efficiently calculated for the full 2-code by developing a three variable ordinary generating function whose coefficients give the specified values. Finally, examples are given to illustrate the probability that an individual disk can be recovered, even when the system is in a failed state.

Original language	English (US)
Pages (from-to)	115-128
Number of pages	14
Journal	Journal of Discrete Algorithms
Volume	5
Issue number	1
DOIs	https://doi.org/10.1016/j.jda.2006.03.003
State	Published - Mar 2007

Keywords

Disk architecture
Erasure resilient code
Error-correcting code
Generating function
Parity check matrix

ASJC Scopus subject areas

Theoretical Computer Science
Discrete Mathematics and Combinatorics
Computational Theory and Mathematics

Access to Document

10.1016/j.jda.2006.03.003

Cite this

@article{3f62b647fa964495b0eaaf0294a5ae6b,

title = "Failed disk recovery in double erasure RAID arrays",

abstract = "Reliability is a major concern in the design of large disk arrays. In this paper, we examine the effect of encountering more failures than that for which the RAID array was initially designed. Erasure codes are incorporated to enable system recovery from a specified number of disk erasures, and strive beyond that threshold to recover the system as frequently, and as thoroughly, as is possible. Erasure codes for tolerating two disk failures are examined. For these double erasure codes, we establish a correspondence between system operation and acyclicity of its graph model. For the most compact double erasure code, the full 2-code, this underlies an efficient algorithm for the computation of system operation probability (all disks operating or recoverable). When the system has failed, some disks are nonetheless recoverable. We extend the graph model to determine the probability that d disks have failed, a of which are recoverable by solving one linear equation, b of which are further recoverable by solving systems of linear equations, and d - a - b of which cannot be recovered. These statistics are efficiently calculated for the full 2-code by developing a three variable ordinary generating function whose coefficients give the specified values. Finally, examples are given to illustrate the probability that an individual disk can be recovered, even when the system is in a failed state.",

keywords = "Disk architecture, Erasure resilient code, Error-correcting code, Generating function, Parity check matrix",

author = "Kaushik Srinivasan and Charles Colbourn",

note = "Funding Information: The authors thank three referees whose helpful comments improved the presentation. Research of the authors is supported by the Army Research Office (USA) under grant number DAAD19-01-1-0406 (Colbourn).",

year = "2007",

month = mar,

doi = "10.1016/j.jda.2006.03.003",

language = "English (US)",

volume = "5",

pages = "115--128",

journal = "Journal of Discrete Algorithms",

issn = "1570-8667",

publisher = "Elsevier",

number = "1",

}

TY - JOUR

T1 - Failed disk recovery in double erasure RAID arrays

AU - Srinivasan, Kaushik

AU - Colbourn, Charles

N1 - Funding Information: The authors thank three referees whose helpful comments improved the presentation. Research of the authors is supported by the Army Research Office (USA) under grant number DAAD19-01-1-0406 (Colbourn).

PY - 2007/3

Y1 - 2007/3

N2 - Reliability is a major concern in the design of large disk arrays. In this paper, we examine the effect of encountering more failures than that for which the RAID array was initially designed. Erasure codes are incorporated to enable system recovery from a specified number of disk erasures, and strive beyond that threshold to recover the system as frequently, and as thoroughly, as is possible. Erasure codes for tolerating two disk failures are examined. For these double erasure codes, we establish a correspondence between system operation and acyclicity of its graph model. For the most compact double erasure code, the full 2-code, this underlies an efficient algorithm for the computation of system operation probability (all disks operating or recoverable). When the system has failed, some disks are nonetheless recoverable. We extend the graph model to determine the probability that d disks have failed, a of which are recoverable by solving one linear equation, b of which are further recoverable by solving systems of linear equations, and d - a - b of which cannot be recovered. These statistics are efficiently calculated for the full 2-code by developing a three variable ordinary generating function whose coefficients give the specified values. Finally, examples are given to illustrate the probability that an individual disk can be recovered, even when the system is in a failed state.

AB - Reliability is a major concern in the design of large disk arrays. In this paper, we examine the effect of encountering more failures than that for which the RAID array was initially designed. Erasure codes are incorporated to enable system recovery from a specified number of disk erasures, and strive beyond that threshold to recover the system as frequently, and as thoroughly, as is possible. Erasure codes for tolerating two disk failures are examined. For these double erasure codes, we establish a correspondence between system operation and acyclicity of its graph model. For the most compact double erasure code, the full 2-code, this underlies an efficient algorithm for the computation of system operation probability (all disks operating or recoverable). When the system has failed, some disks are nonetheless recoverable. We extend the graph model to determine the probability that d disks have failed, a of which are recoverable by solving one linear equation, b of which are further recoverable by solving systems of linear equations, and d - a - b of which cannot be recovered. These statistics are efficiently calculated for the full 2-code by developing a three variable ordinary generating function whose coefficients give the specified values. Finally, examples are given to illustrate the probability that an individual disk can be recovered, even when the system is in a failed state.

KW - Disk architecture

KW - Erasure resilient code

KW - Error-correcting code

KW - Generating function

KW - Parity check matrix

UR - http://www.scopus.com/inward/record.url?scp=33947610919&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33947610919&partnerID=8YFLogxK

U2 - 10.1016/j.jda.2006.03.003

DO - 10.1016/j.jda.2006.03.003

M3 - Article

AN - SCOPUS:33947610919

SN - 1570-8667

VL - 5

SP - 115

EP - 128

JO - Journal of Discrete Algorithms

JF - Journal of Discrete Algorithms

IS - 1

ER -

Failed disk recovery in double erasure RAID arrays

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this