Permutation Codes: Optimal Exact-Repair of a ... - Semantic Scholar

Report 2 Downloads 55 Views
Permutation Codes: Optimal Exact-Repair of a Single Failed Node in MDS Code Based Distributed Storage Systems Viveck R. Cadambe

Cheng Huang, Jin Li

Electrical Engineering and Computer Science University of California Irvine, Irvine, California, 92697, USA Email: [email protected]

Communications, Collaborations and Systems Group, Microsoft Research, Redmond, WA, USA Email: {chengh, jinl}@microsoft.com

Abstract It is well known that maximum distance separable (MDS) codes are an efficient means of storing data in distributed storage systems, since they provide optimal tolerance against failures (erasures) for a fixed amount of storage. A (n, k) code can be used in a distributed storage system with n disks, each having a storage capacity of 1 unit, to store k units of information, where n > k. If the code used is maximum distance separable, then the storage system can tolerate up to (n − k) disk failures (erasures). The focus of this paper is the design of a MDS code with the additional property that a single disk failure can be repaired with optimal repair bandwidth, i.e., the amount of data to be downloaded for n−1 units has been established by Dimakis et. al, on the recovery of the failed node. Previously, a lower bound of n−k repair bandwidth for a single node failure in an (n, k) MDS code based storage system, where each of the n disks store 1 unit. While the reference by Dimakis et. al., established a lower bound on the repair bandwidth, existence of codes achieving this lower bound for arbitrary (n, k) have only been established in an asymptotic manner, in the limit of arbitrarily large storage capacities. While finite code constructions achieving this lower bound have been provided previously for k ≤ max(n/2, 3), the question of existence of finite codes for arbitrary values of (n, k) achieving the lower bound remained open. In this paper, we provide the first known construction of a finite code for arbitrary n−1 (n, k), which can repair a single failed systematic node by downloading exactly n−k units of data. The codes, which are optimally efficient in terms of a metric known as repair bandwidth, are based on permutation matrices and hence termed Permutation Codes. We also show that our code has a simple repair property which enables efficiency, not only in terms of the amount of repair bandwidth, but also in terms of the amount of data accessed on the disk.

I. I NTRODUCTION Consider a distributed storage system with n distributed data disks, each storing one unit of data. The data storage devices in the storage system can possibly fail. To protect the storage system from information loss in case of disk failure, the amount of storage space in the system is greater than the amount of information stored, with the extra storage space used to design redundancy in the system. Assume that the amount of information to be stored in this storage system is equal to k units, where k < n. Then, it is well known that the optimal tolerance to failures is provided by using an (n, k) maximum distance separable (MDS) code to store the data. Such a code would tolerate any (n − k) disk failures, since the MDS property ensures that the original information can be recovered by using any k surviving disks. While codes which are not maximum distance separable (such as repetition codes) can be used in the storage system, they have lower redundancy for a given amount of storage. Equivalently, for a given amount of redundancy, MDS codes provide the smallest cost of storage in distributed data storage systems. There is an enormous amount of literature associated with the search of erasure codes for storage systems with efficient encoding and decoding properties (See for example, [1]–[4]). While an MDS code based storage system can tolerate a worst-case failure scenario of k disks, the most common failure scenario in a storage system is the case where a single disk fails. In case of single disk failure, efficient (fast) recovery of the failed disk is important, since replacing the disk before other disks fail reduces the chance of data loss and improves the overall reliability of the system. The focus of this paper is on recovery efficiency (speed) of a single disk failure in an MDS code for distributed storage systems. When a single node fails, a new node enters the storage system, connects to the surviving n − 1 disks via a network, downloads data from these (n − 1) surviving disks, and reconstructs the failed node. The primary factor in determining the speed of recovery of a failed node is the amount of time taken for the new node to download the necessary data from surviving disks, which, in turn, depends on the amount of data accessed and downloaded1 by the new node. This problem has been studied previously from the perspective of the amount of data to be downloaded - also known as the repair bandwidth by the new node in [5]–[11]. Note that a trivial solution for any (n, k) MDS code is to achieve a repair bandwidth of k units. This is because the entire original data, and hence the failed disk, can be recovered with the new node reading any set of k surviving disks completely. The goal of our paper, as is the goal of references [5]–[11], is to minimize the amount of repair bandwidth to recover a single failed node in a MDS code based distributed storage system. We restrict our code design to systematic MDS codes, i.e., where the first k storage nodes, also known as systematic nodes, store an uncoded copy of the k information units. The remaining n − k nodes are referred to as parity nodes. The systematic property is a crucial practical constraint in storage systems, since it ensures ease of access of data under normal operation (i.e., no failures). Further, unlike in [5], we require the new node to be a replica of the failed node (also termed exact regeneration in [6]), so that the original systematic code structure is retained after the repair operation is completed. The codes for the scenario we focus on, in this paper, have also been termed minimum storage regenerating (MSR) codes, in literature. We next summarize the results of references [5]–[11] before proceeding to describe our contributions to this problem. A. Previous Work The goal of minimizing the amount of repair bandwidth was initiated in [5]. The reference studied the problem from the perspective of functional repair, where the new (replaced) node need not be a replica of the failed node; it only needs to be information equivalent to the failed node, so that in combination with the original n − 1 surviving nodes, the code still maintains the MDS property. This study showed that, under these constraints, the minimum n−1 units, where, as mentioned before, each of the n nodes stores 1 unit. Since repair bandwidth is equal to n−k n−1 ≤ k, ∀n > k > 1, the result meant that for the constraints of the functional repair problem, the trivial strategy n−k of downloading k units could be improved for any (n, k). Further, since the constraints of functional repair [5] are weaker than the constraints of our problem where the code has to be systematic, and recovery has to be exact, the result implied that for any MDS code based system, Repair Bandwidth ≥

n−1 , n−k

n−1 i.e., n−k is a lower bound for the problem of exact regeneration which is of interest in this paper. An important question is whether this lower bound is tight even if the stricter and more practical constraints of systematic 1 There

is a subtle difference between the amount of data accessed and downloaded; Such differences are explored later in Section II-A

coding and exact regeneration are imposed. This question was motivated in reference [6] for the special case n−1 = 3/2 units is indeed tight. The of (n = 4, k = 2), for which, it was shown that the lower bound of n−k reference made a connection between the repair bandwidth problem and the technique of interference alignment which was developed in the context of wireless communications [12]–[14]. This question was later studied for more general cases in [7]–[11], [15]. By expanding connections of the repair bandwidth problem to interference n−1 alignment, references [7], [8] constructed scalar linear code constructions to show that the bound of n−k is tight n n for k ≤ max( 2 , 3). In other words, for k ≤ max( 2 , 3), perhaps surprisingly, there is no loss in terms of minimum repair bandwidth even if we impose the stricter constraints of exact regeneration and systematic coding. References n−1 [11], [15] provided alternate linear code constructions for the case of k ≤ n2 achieving the lower bound of n−k n−1 units. While the achievability of n−k units for the range of high redundancy (i.e., k ≤ n/2) is a powerful result, most practical distributed storage systems do no operate in this regime. In fact, in most distributed storage systems the number of parity nodes are smaller than the number of systematic nodes , i.e., n > k/2. However, this practical setting is more challenging and results in current literature are relatively weaker. References [9], [10] constructed, for any (n, k) including the cases where n > k/2, a class of asymptotic codes that achieve a repair bandwidth of n−1 2 n−k asymptotically, as the amount of information stored in a single node grows arbitrarily large . In other words, they showed that exact regeneration is asymptotically equivalent to functional regeneration. While this asymptotic equivalence is an interesting theoretical limit to what practical codes can achieve, the existence of finite codes n−1 achieving a repair bandwidth of n−k units remained an open problem of practical interest. In fact, for arbitrary (n, k), the construction of finite codes having a repair strategy more efficient than the trivial repair strategy with a repair bandwidth of k units remained open. The only other insight in previous literature regarding this open problem comes from [7] which restricts coding schemes to scalar linear codes, and shows that under this restriction, the n−1 units on the repair bandwidth is NOT achievable. It is this open problem - the existence of finite limit of n−k n−1 codes achieving the lower bound of n−k units on the repair bandwidth - that is the object of focus of this paper. The main contribution of this paper is the achievability of this limit using finite-length vector linear codes, for the failure of systematic nodes. We explore our main contributions in greater detail in the next section. Before we proceed, we note that there exists, in literature, a parallel line of work, which studies the repair bandwidth for codes which are not necessarily MDS and hence use a greater amount of storage for a given amount of redundancy [5], [7], [16]. These references study the tradeoff between the amount of storage and the repair bandwidth required, for a given amount of redundancy. Further, we also note that design of codes, from the perspective of efficient recovery of its information elements for error-correcting (rather than erasure) erasure has also been studied in literature in associated with locally decodable codes (See [17] and references therein). The focus of this paper, however, is on MDS erasure codes (also referred to as minimum storage regenerating codes), i.e., (n, k) codes which can tolerate any (n − k) erasures. II. S UMMARY

OF

C ONTRIBUTIONS

The main contribution of this paper is the design a new class of MDS codes which achieve the minimum n−1 units for the repair of a single failed systematic node. Specifically, the code constructions repair bandwidth of n−k presented in this paper are listed below. 1) Random Permutation Codes: In Section V, we present a construction of codes which achieve the repair n−1 bandwidth lower bound of n−k units for repair of systematic nodes for any tuple (n, k) where n > k. The code construction, albeit finite, is based on random coding, with the random coding argument used to justify the existence of a repair-bandwidth optimal MDS code. This means that for any arbitrary (n, k), a brute-force search over a (finite) set of codes described in the section, will yield a repair-bandwidth optimal MDS code. Since our code construction is based on permutation matrices, the codes are termed Permutation Codes. 2) Explicit Construction of Permutation Codes for n − k ∈ {2, 3}: While Section V describes a random coding based construction, we also provide in Section VI, explicit constructions for the special case of n − k ∈ {2, 3}. It must be noted that the search for codes with efficiently repair both systematic and parity nodes is still open. However, from a practical perspective, the step taken in this paper is important since, in most storage systems, the number of parity nodes is small compared to systematic nodes. 2 This

means that a unit must be equivalent to an arbitrarily large number of bits

A. Efficient Code Construction in terms of Disk Access While most previous works described above explore the repair problem by accounting for the amount of information to be sent over the network for repair, there exists another important cost during the repair of a node viz. amount of disk access. To understand the difference between these two costs, consider a toy example of a case where a disk stores two bits a1 , a2 . Now, suppose that, to repair some other failed node in this system, the bit a1 + a2 has to be sent to the new node. This means that the bandwidth required for this particular disk is 1 bit. However, in many storage systems, the disk-read speed is slower than the network transfer speeds and hence becomes a bottleneck. In the case where the disk read speed is a bottleneck, the defining factor in the speed of repair is the amount of disk access rather than the repair bandwidth. In the toy example described the amount of disk access is 2 bits as both a1 and a2 have to be read from the disk to compute a1 + a2 . Thus, it is possible that certain codes, while minimizing repair bandwidth, can perform poorly in terms of disk access rendering the codes impractical. In this paper, we will formalize this notion of disk access cost, and show that Permutation Codes are not only bandwidth optimal, but also disk-access optimal, for the repair of a single failed systematic node. III. T HE SIMPLEST P ERMUTATION C ODE : n = 4, k = 2 We start with the case of n = 4, k = 2. Repair-bandwidth optimal codes for the case of n = 4, k = 2 have, in fact, been developed previously in [6]. We only use this case as a toy example to bring out the interference alignment perspective of the challenges of code development. We have k = 2 sources are denoted as a1 and a2 T respectively, with ai = [ai (1) ai (2)] being a 2 × 1 vector over a field of size q ≥ 5. The goal is to design a code of n = 4 elements, where each of the four code element, is a 2 × 1 vector. The ith code element is denoted as di for i = 1, 2, 3, 4. Since each storage/code element is equivalent to 2 scalars over the field, we consider 1 unit - the amount of storage in each disk - to be equivalent to 2 scalars over the field. The goal is to design an MDS code which repairs a single failed systematic node by downloading a total of 3/2 units, or equivalently 3 scalars over the field. In particular, we present a code which repairs a failed systematic node by downloading a single scalar from the remaining 3 surviving nodes. Since we restrict ourselves to systematic codes, we have d1 = a1 and d2 = a2 . Let the parity nodes i = 3, 4 store vectors of the form di = Ci,1 a1 + Ci,2 x2 , where C3,1 , C3,2 , C4,1 , C4,2 are all 2 × 2 matrices over the field. As depicted in Figure 1, we have   1 0 C3,1 = C3,2 = C4,2 = 0 1   0 1 C4,1 = 2 0 It can be verified that the above code is an MDS code, since a1 , a2 can be recovered from any 2 nodes. Now, suppose node 1 fails, a naive strategy would be to download the 4 scalars stored in any 2 nodes completely, and recover all of a1 and a2 . In this naive strategy, while we only desire to reconstruct a1 , we also end up completely reconstructing the variable a2 , although it is not desired. Thus, this strategy uses 4 dimensions to download 4 variables - two desired variables and two undesired variables. However, as depicted in Figure 1, a more efficient strategy that downloads a total of 3 scalars - one scalar from each surviving node - exists. This more efficient strategy achieves this by restricting the undesired variable (or interference) a2 to only 1 dimension. The idea is to download scalar V1 di from node i for i = 2, 3, 4, where V1 = [1 0]. Then, since   1 rowspan(V1 ) = rowspan(V1 C3,2 ) = rowspan(V1 C4,2 ) = rowspan (1) 0 the interference associated with a2 aligns along the scalar V1 a2 = a2 (1). This variable, being downloaded from node 2 as V1 d2 = a2 (1) is cancelled to obtain 2 equations in the two desired variables a1 (1) and a1 (2). Because   V1 C3,1 =2 (2) rank V1 C4,1 the desired variables can be recovered and repaired. Similarly, if node 2 fails, it can be noted that downloading V2 di , where V2 = [0 1], from nodes i ∈ {1, 3} and V1 d4 from node 4, suffices for repair because the following

a1 (1) a1 (2)

a2 (1) a2 (2)

a2 (1)

a1 (1) + a2 (1) a1 (2) + a2 (2)

a1 (1) New Node

a1 (1) + a2 (1)

a1 (2)

a1 (2) + a2 (1) a1 (2) + a2 (1) 2a1 (1) + a2 (2)

Aligned Interference

Fig. 1.

An optimal (4,2) code

are satisfied. rowspan(V2 C3,1 ) = rowspan(V1 C4,1 ) = rowspan(V2 ),   V2 C3,2 =2 rank V1 C4,2

(3) (4)

Note that the coding matrices in the above construction are permutation matrices. In the context of interference channels, equations (1) and (3) are similar to the alignment constraints and equations (2) and (4) are related to the reconstruction of the desired signal (See [14]). The coding matrices Ci,j are analogous to the channel gain matrices in wireless interference channels, and the vectors V1 and V2 are analogous to the beamforming vectors in the wireless context. Note that there is a fundamental difference between the wireless interference channel and the storage setting. In the storage setting, is that we are allowed to design the channel matrices satisfying the desired alignment relations, since we are allowed to design the coding matrices. While this flexibility enables a simple solution for the case of n = 4, k = 2, the extensions of this solution to more general cases of n, k are not trivial. In fact, it is shown in [7] that the extensions are not even possible, if we restrict our code elements di to be (n− k)× 1 vectors. As we will show in this paper, such extensions are in fact possible, by choosing the code elements di to be L = (n − k)k dimensional vectors. We will next introduce a general notation and on overview of our solution. IV. OVERVIEW OF THE S OLUTION FOR A RBITRARY (n, k) Consider k sources, all of equal size L = M/k over a field Fq of size q. Source i ∈ {1, 2, . . . , k} is represented by the L × 1 vector ai ∈ FL q . Note here that M denotes the size of the total information stored in the distributed storage system, in terms of the number of elements over the field. There are n nodes storing a code of the k source symbols in an (n, k) MDS code. Each node stores a data of size L, i.e., each coded symbol of the (n, k) code is a L × 1 vector. Therefore, 1 unit is equivalent to L scalars over the field q. The data stored in node i represented by L × 1 vector di , where i = 1, 2, . . . , n. We assume that our code is linear and di can be represented as di =

k X

Ci,j aj ,

j=1

where Ci,j are L × L square matrices. Further, we restrict our codes to have a systematic structure, so that, for i ∈ {1, 2, . . . , k},   I j=i Ci,j = . 0 j 6= i

Since we restrict our attention to MDS codes, we will Property 1:  Cj1 ,1 Cj1 ,2  Cj2 ,1 Cj2 ,2  rank  .. ..  . . Cjk ,1 Cjk ,2

need the matrices Ci,j to satisfy the following property  . . . Cj1 ,k  . . . Cj2 ,k   (5)  = Lk = M .. ..  . . . . . Cjk ,k

for any distinct j1 , j2 , . . . , jk ∈ {1, 2, . . . , n}. The MDS property ensures that the storage system can tolerate up to (n − k) failures (erasures), since all the sources can be reconstructed from any k nodes whose indices are represented by j1 , j2 , . . . , jk ∈ {1, 2, . . . , n}. Now, consider the case where a single systematic node, say node i ∈ {1, 2, . . . , k} fails. The goal here is to reconstruct the failed node i, i.e., to reconstruct di , using all the other n − 1 nodes, i.e., {dj : j 6= i}. To understand the 1 solution, first, consider the case where node 1 fails. We download a fraction of n−k of the data stored in each of the n−1 nodes {1, 2, 3, . . . , n} − {1}, so that the total repair bandwidth is n−k units. We focus on linear repair solutions for L linear combinations from each of dj , j ∈ {2, 3, . . . , n}. our codes, which implies that we need to download n−k Specifically, we denote the linear combination downloaded from node j ∈ {2, 3, . . . , n} as V1,j dj

= V1,j

k X

Ci,j aj

j=1

=

V1,j Ci,1 a1 + | {z } Desired signal component

V1,j

j X

Ci,j aj

,

j=2

| {z } Interference component

L × L dimensional matrix. The matrices V1,j are referred to as repair matrices in this paper. where V1,j is a n−k The goal of the problem is to construct L components of a1 from the above equations. For systematic node j ∈ {2, 3, . . . , k}, the equations downloaded by the new node do not contain information of the desired signal a1 , since for these nodes, Cj,1 = 0. The linear combinations downloaded from the remaining nodes j ∈ {k + 1, k + 2, . . . , n}, however, contain components of both the desired signal and the interference. Thus, the downloaded linear combinations V1,j dj are of two types. 1) The data downloaded from the surviving systematic nodes i = 2, . . . , k contain no information of the desired signal a1 , i.e., V1,j dj = V1,j aj , j = 2, . . . , k. L Note that there n−k such linear combinations of each interfering component aj , j = 2, 3, . . . , k. L = L linear combinations 2) The L components of the desired signal have to be reconstructed using the (n−k). n−k of the form V1,j dj , j = k + 1, k + 2, . . . , n. Note that these linear combinations also contain the interference terms aj , j = 2 . . . , k which need to be cancelled. The goal of our solution will be to completely cancel the interference from the second set of L linear combinations, using the former set of linear combinations, and then to regenerate x1 using the latter L interference-free linear combinations.

A. Interference Cancellation The linear combinations corresponding to interference component ai , i 6= 1 downloaded using node i by the new T node is V1,i ai for i = 2, 3, . . . , k. To cancel the associated interference from all the remaining nodes V1,j dj by linear techniques, we will need, ∀j = k + 1, k + 2, . . . , n, ∀i = 2, 3, . . . , k rowspan(V1,j Cj,i )

T ⊆ rowspan(V1,i ),

rowspan(V1,j Cj,i )

T = rowspan(V1,i ),

(6)

where the final equality (6) follows because Cj,i are all full rank matrices and therefore, the subset relation L = rank(V1,i ). Thus, as long automatically implies the equality relation as rank(CTj,i V1,j ) = rank(V1,j ) = n−k as (6) is satisfied for all values j ∈ {k + 1, k + 2, . . . , n}, i ∈ {2, 3, . . . , k}, the interference components can be

completely cancelled from V1,j dj to obtain V1,j Cj,1 a1 , j ∈ {k + 1, k + 2, . . . , n}. Now, we need to ensure that T the desired L × 1 vector a1 can be uniquely resolved, from the L linear combinations of the form V1,j Cj,1 a1 , j = k + 1, k + 2, . . . , n. In other words, we need to ensure that   V1,k+1 Ck+1,1  V1,k+2 Ck+2,1    (7) rank   = L ..   . V1,n Cn,1

If we construct Cl,j and V1,i satisfying (6) and (7) for i = 2, . . . , n, j = 1, 2, . . . , k, l = 1, 2, 3 . . . , n, then, a failure of node 1 can be repaired with the desired minimum repair bandwidth. To solve the problem for the failure of any other systematic node, we need to ensure similar conditions. We summarize all the conditions required for successful reconstruction of a single failed (systematic) node with the minimum repair bandwidth below. • Equation (5) in Property 1. • The interference alignment relations. rowspan(Vl,j Cj,i ) = rowspan(Vl,i ) •

for l = 1, 2, . . . , k, j = k + 1, k + 2, . . . , n and i ∈ {1, 2, . . . , k} − {l} Reconstruction of the failed node, given that the alignment relations are satisfied.   Vl,k+1 Ck+1,l  Vl,k+2 Ck+2,l    rank   = L ..   .

(8)

Vl,n Cn,l

for l = 1, 2, . . . , k, j = k + 1, k + 2, . . . , n and i ∈ {1, 2, . . . , k} − {l} Note that given n, k, our design choices are L, q, Cj,i and Vl,j for l = 1, 2, . . . , k, j = k + 1, k + 2, . . . , n and i ∈ {1, 2, . . . , k} − {l}. Reference [7] has shown that the above conditions can NOT be satisfied if we restrict ourselves to M = k(n−k). References [9], [10] constructed solutions which satisfied the above relations in an asymptotically exact, as M → ∞. The main contribution of this paper is the construction of coding matrices and repair matrices so that the above relations are satisfied exactly, with finite M, i.e., with M = k(n − k)k . B. Characteristics of Our Solution Above, we have defined a general structure for a linear, repair bandwidth optimal solution. For the solution described in this paper, the repair matrices satisfy a set of additional properties described here. First, in our solution, Vl,j = Vl,j ′ ′



for all l ∈ {1, 2, . . . , k}, j 6= j , j, j ∈ {1, 2, . . . , n} − {l}. In other words, when a node, say node l, fails, we download the same linear combination from every surviving node. We use the notation △

Vl = Vl,j for all j ∈ {1, 2, . . . , n} − {l}. The second property satisfied by our solution, is that it is disk-access optimal. This notion is explained further next. C. Disk-Access Optimality Definition 1: Consider a set of L × L dimensional coding matrices Ci,j , i = k + 1, k + 2, . . . , n, j = 1, 2, . . . , k and a set of repair matrices Wl,i for some l ∈ {1, 2, . . . , n} and for all i ∈ {1, 2, . . . , n} − {l}, where the repair matrix Wl,i has dimension Bl,i × L, where Bl,i ≤ L. The repair matrices satisfy the property that dl can be reconstructed linearly from Wl,i di , i ∈ {1, 2, . . . , n} − {l}. In other words, a failure of node l can be repaired using the repair matrices. Then the amount of disk access required for the repair of node l is defined to be the

λ4,1 a1 (!0, 0, 0") + λ4,2 a2 (!0, 0, 0") + λ4,3 a3 (!0, 0, 0") λ4,1 a1 (!0, 0, 1") + λ4,2 a2 (!0, 0, 1") + λ4,3 a3 (!0, 0, 1")

λ5,1 a1 (!1, 0, 0") + λ5,2 a2 (!0, 1, 0") + λ5,3 a3 (!0, 0, 1") λ5,1 a1 (!1, 0, 1") + λ5,2 a2 (!0, 1, 1") + λ5,3 a3 (!0, 0, 0")

λ4,1 a1 (!0, 1, 0") + λ4,2 a2 (!0, 1, 0") + λ4,3 a3 (!0, 1, 0") λ4,1 a1 (!0, 1, 1") + λ4,2 a2 (!0, 1, 1") + λ4,3 a3 (!0, 1, 1")

λ5,1 a1 (!1, 1, 0") + λ5,2 a2 (!0, 0, 0") + λ5,3 a3 (!0, 1, 1") λ5,1 a1 (!1, 1, 1") + λ5,2 a2 (!0, 0, 1") + λ5,3 a3 (!0, 1, 0")

λ4,1 a1 (!1, 0, 0") + λ4,2 a2 (!1, 0, 0") + λ4,3 a3 (!1, 0, 0")

λ5,1 a1 (!0, 0, 0") + λ5,2 a2 (!1, 1, 0") + λ5,3 a3 (!1, 0, 1")

λ4,1 a1 (!1, 0, 1") + λ4,2 a2 (!1, 0, 1") + λ4,3 a3 (!1, 0, 1")

λ5,1 a1 (!0, 0, 1") + λ5,2 a2 (!1, 1, 1") + λ5,3 a3 (!1, 0, 0")

λ4,1 a1 (!1, 1, 0") + λ4,2 a2 (!1, 1, 0") + λ4,3 a3 (!1, 1, 0")

λ5,1 a1 (!0, 1, 0") + λ5,2 a2 (!1, 0, 0") + λ5,3 a3 (!1, 1, 1")

λ4,1 a1 (!1, 1, 1") + λ4,2 a2 (!1, 1, 1") + λ4,3 a3 (!1, 1, 1")

λ5,1 a1 (!0, 1, 1") + λ5,2 a2 (!1, 0, 1") + λ5,3 a3 (!1, 1, 0")

Node 4

Node 5 Fig. 2.

The two parity nodes in the (5, 3) permutation code

quantity

X

ω(Wl,i )

i={1,2,...,n}−{l}

where ω(A) represents the number of non-zero columns of matrix A. To compute Wl,i di , only ω(Wl,i ) entries of the matrix di have to be accessed. This leads to the above definition for the amount of disk access for a linear solution. Also, note that if rank(Wl,i ) - the amount of bandwidth used - is always smaller than ω(Wl,i ). Therefore, the amount of disk access is smaller than the amount of bandwidth used for a given solution. This leads to the following lemma. Lemma 1: For any (n, k) MDS code storing 1 unit of data in each disk, the amount of disk access needed to n−1 repair any single failed node l = 1, 2, . . . , n is at least as large as n−k units. It turns out that our solution is not only repair bandwidth optimal, but it is also optimal in terms of disk access. More formally, for our solution Vj not only has a rank of L/(n − k), it also has exactly L/(n − k) non-zero L columns columns; in fact, Vj has exactly L/(n − k) non-zero entries. Among the L columns of Vj , L − n−k L are zero. This means that, to obtain the linear combination Vl di from node i for repair of node l 6= i, only n−k entries of the node i has to be accessed. We now proceed to describe our solution. V. P ERMUTATION C ODE In this section, we describe a set of random codes based on permutation matrices satisfying the desired properties described in the previous section. We begin with some preliminary notations required for our description. Notations and Preliminary Definitions: The bold font is used for vectors and matrices and the regular font is reserved for scalars. Given a l × 1 dimensional vector a its l components are denoted by   a(1)  a(2)    a= .   ..  a(l)

T

For example, d1 = [d1 (1) d1 (2) . . . d1 (L)] . Given a set A, the l-dimensional Cartesian product of the set is denoted by Al . The notation Il denotes the l × l identity matrix; the subscript l is dropped when the size l is clear from the context. Next, we define a set of functions which will be useful in the description of permutation codes. ~ : {1, 2, . . . , (n − k)k → Given (n, k) and a number m ∈ {1, 2, . . . , (n − k)k }, we define a function3 φ k ~ {0, 1, . . . , (n − k − 1)} such that φ(m) is the unique k dimensional vector whose k components represent the 3 While

the functions defined here are parametrized by n, k, these quantities are not explicitly denoted here for brevity of notation

k-length representation of m − 1 in base (n − k). In other words ~ φ(m) = (r1 , r2 , . . . , rk ) ⇔ m − 1 =

k X

ri (n − k)i−1 ,

i=1

~ where ri ∈ {0, 1, . . . , (n − k − 1)}. Further, we denote the ith component of φ(m) by φi (m), for i = 1, 2, . . . , k. ~ Since the k-length representation of a number in base (n − k) is unique, φ and φi are well defined functions. ~ is invertible and its inverse is denoted by φ−1 . We also use the following compressed notation for φ−1 . Further, φ △

−1

hr1 , r2 , . . . , rk i = φ

(r1 , r2 , . . . , rk ) =

k X

ri (n − k)i−1 − 1

i=1

The definition of the above functions will be useful in constructing our codes. A. Example : n=5, k=3 We motivate our code by first considering the case where k = 3, n = 5 for simplicity. The extension of the code to arbitrary n, k will follow later4 For n = 5, k = 3, we have M/k = (n − k)k = 23 = 8. As the name suggests, we use scaled permutation matrices for Ci,j , j ∈ {1, 2, . . . , k}, i ∈ {k + 1, k + 2, . . . , n}. Note here that the variables aj , j = 1, 2, . . . , k are (n − k)k × 1 dimensional vectors. We represent the (n − k)k = 8 components these vectors by the k = 3 bit representation of their indices as   aj (h0, 0, 0i)  aj (h0, 0, 1i)     aj (h0, 1, 0i)    T  aj = (aj (1) aj (2) . . . aj (8)) =   aj (h1, 0, 0i)   aj (h1, 0, 1i)     aj (h1, 1, 0i)  aj (h1, 1, 1i) for all j = 1, 2, . . . , k. Now, similarly, we can denote the identity matrix as     e(h0, 0, 0i) e(1)  e(2)   e(h0, 0, 1i)      I8 =  .  =  , ..   ..   . e(h1, 1, 1i) e(8) where, naturally, e(i) is the ith row of the identity matrix. Now, we describe our code as follows. Since the first three storage nodes are systematic nodes and the remaining two are parity nodes, the design parameters are C4,j , C5,j , Vj for j = 1, 2, 3. We choose C4,j = λ4,j I so that d4 =

3 X

λ4,j aj ,

j=1

where λ4,j are independent random scalars chosen using a uniform distribution over the field Fq . Now, consider 4 Optimal

codes for n = 5, k = 3 have been proposed in [8], [18]. We only use this case to demonstrate our construction in a simple setting.

the 8 × 8 permutation matrix Pi defined as    e(h1, 0, 0i) e(h0, 1, 0i)  e(h1, 0, 1i)   e(h0, 1, 1i)     e(h1, 1, 0i)   e(h0, 0, 0i)     e(h1, 1, 1i)    , P2 =  e(h0, 0, 1i) P1 =   e(h0, 0, 0i)   e(h1, 1, 0i)     e(h0, 0, 1i)   e(h1, 1, 1i)     e(h0, 1, 0i)   e(h1, 0, 0i) e(h0, 1, 1i) e(h1, 0, 1i)





           , P3 =           

e(h0, 0, 1i) e(h0, 0, 0i) e(h0, 1, 1i) e(h0, 1, 0i) e(h1, 0, 1i) e(h1, 0, 0i) e(h1, 1, 1i) e(h1, 1, 0i)

           

Then, the fifth node (i.e., the second parity node) is designed as d5 =

3 X

λ5,j Pj aj ,

j=1

where λ5,j are random independent scalars drawn uniformly over the entries of the field Fq . In other words, we have C5,j = λ5,j Pj , j = 1, 2, 3. The code is depicted in Figure 2. For a better understanding of the structure of the permutations, consider an T arbitrary column vector a = [a(1) a(2) . . . a(8)] . Then,     a(h1, 0, 0i) a(5)  a(h1, 0, 1i)   a(6)       a(h1, 1, 0i)   a(7)         P1 a =   a(h0, 0, 0i)  =  a(1)   a(h0, 0, 1i)   a(2)       a(h0, 1, 0i)   a(3)  a(h0, 1, 1i) a(4) In other words, P1 is a permutation of the components of a such that the element a(h1, x2 , x3 i) is swapped with the element a(h0, x2 , x3 i) for x2 , x3 ∈ {0, 1}. Similarly, P2 swaps a(hx1 , 0, x3 i) with a(hx1 , 1, x3 i) and P3 swaps a(hx1 , x2 , 0i) with a(hx1 , x2 , 1i) where x1 , x2 , x3 ∈ {0, 1}. Now, we show that this code can be used to achieve optimal recovery, in terms of repair bandwidth, for a single failed systematic node. To see this, consider the case where node 1 fails. Note that for optimal repair, the new 1 = 12 of every surviving node, i.e., nodes 2, 3, 4, 5. The repair strategy is node has to download a fraction of n−k to download di (h0, 0, 0i), di (h0, 0, 1i), di (h0, 1, 0i), di (h0, 1, 1i) from node i ∈ {2, 3, 4, 5}, so that     e(h0, 0, 0i) e(1)  e(h0, 0, 1i)   e(2)     V1 =   e(h0, 1, 0i)  =  e(3)  e(h0, 1, 1i) e(4) In other words, the rows of V1 come from the set {e(h0, x2 , x3 i) : x2 , x3 ∈ {0, 1}}. Note that the strategy downloads half the data stored in every surviving node as required. With these download vectors, it can be observed (See Figure 3) that the interference is aligned as required and all the 8 components of the desired signal a1 can be reconstructed. Specifically we note that rowspan(V1 C4,i ) = rowspan(V1 C5,i ) = span({e(h0, x2 , x3 i) : x2 , x3 ∈ {0, 1}})

(9)

for i = 2, 3: Put differently, because of the structure of the permutations, the downloaded components can be expressed as d4 (h0, x2 , x3 i) = λ4,1 a1 (h0, x2 , x3 i) + λ4,2 a2 (h0, x2 , x3 i) + λ4,3 a3 (h0, x2 , x3 i) d5 (h0, x2 , x3 i) = λ5,1 a1 (h1, x2 , x3 i) + λ5,2 a2 (h0, x2 ⊕ 1, x3 i) + λ5,3 a3 (h0, x2 , x3 ⊕ 1i) Note that since x2 , x3 ∈ {0, 1} there are a total 8 components described in the two equations above, such that,

Desired Signals

!

Download these components to repair Node 1

Aligned interference

λ4,1 a1 (!0, 0, 0") + λ4,2 a2 (!0, 0, 0") + λ4,3 a3 (!0, 0, 0") λ4,1 a1 (!0, 0, 1") + λ4,2 a2 (!0, 0, 1") + λ4,3 a3 (!0, 0, 1")

λ5,1 a1 (!1, 0, 0") + λ5,2 a2 (!0, 1, 0") + λ5,3 a3 (!0, 0, 1") λ5,1 a1 (!1, 0, 1") + λ5,2 a2 (!0, 1, 1") + λ5,3 a3 (!0, 0, 0")

λ4,1 a1 (!0, 1, 0") + λ4,2 a2 (!0, 1, 0") + λ4,3 a3 (!0, 1, 0") λ4,1 a1 (!0, 1, 1") + λ4,2 a2 (!0, 1, 1") + λ4,3 a3 (!0, 1, 1")

λ5,1 a1 (!1, 1, 0") + λ5,2 a2 (!0, 0, 0") + λ5,3 a3 (!0, 1, 1") λ5,1 a1 (!1, 1, 1") + λ5,2 a2 (!0, 0, 1") + λ5,3 a3 (!0, 1, 0")

λ4,1 a1 (!1, 0, 0") + λ4,2 a2 (!1, 0, 0") + λ4,3 a3 (!1, 0, 0")

λ5,1 a1 (!0, 0, 0") + λ5,2 a2 (!1, 1, 0") + λ5,3 a3 (!1, 0, 1")

λ4,1 a1 (!1, 0, 1") + λ4,2 a2 (!1, 0, 1") + λ4,3 a3 (!1, 0, 1")

λ5,1 a1 (!0, 0, 1") + λ5,2 a2 (!1, 1, 1") + λ5,3 a3 (!1, 0, 0")

λ4,1 a1 (!1, 1, 0") + λ4,2 a2 (!1, 1, 0") + λ4,3 a3 (!1, 1, 0")

λ5,1 a1 (!0, 1, 0") + λ5,2 a2 (!1, 0, 0") + λ5,3 a3 (!1, 1, 1")

λ4,1 a1 (!1, 1, 1") + λ4,2 a2 (!1, 1, 1") + λ4,3 a3 (!1, 1, 1")

λ5,1 a1 (!0, 1, 1") + λ5,2 a2 (!1, 0, 1") + λ5,3 a3 (!1, 1, 0")

Node 4

!

Download these components to repair Node 1

Node 5

Fig. 3. Shaded portions indicate downloaded portions used to recover failure of node 1. Note that the undesired symbols can be cancelled by downloading half the components of a2 , a3 , i.e., by downloading a2 (h0, x1 , x2 i) and a3 (h0, x1 , x2 i) for x1 , x2 ∈ {0, 1}.

all the interference is of the form ai (h0, y2 , y3 i), i ∈ {2, 3}, y2, y3 ∈ {0, 1}. In other words, the interference from ai , i = 2, 3 comes from only half its components, and the interference is aligned as described in (9). However, note that the 8 components span all the 8 components of the desired signal a1 . Thus, the interference can be completely cancelled and the desired signal can be completely reconstructed. Similarly, in case of failure of node 2, the set of rows of the repair matrices V2 is equal to the set {e(hx1 , 0, x3 i) : x1 , x3 ∈ {0, 1}}, i.e.,     e(h0, 0, 0i) e(1)  e(h0, 0, 1i)   e(2)     V2 =   e(h1, 0, 0i)  =  e(5)  e(h1, 0, 1i) e(6) With this set of download vectors, it can be noted that, for i = 1, 3 rowspan(V2 C4,i ) = rowspan(V2 C5,i ) = span({e(hx1 , 0, x3 i) : x1 , x3 ∈ {0, 1}})

(10)

so that the interference is aligned. It can be verified that the desired signal can be reconstructed completely because of condition (8) as well. The rows of V3 come from the set {e(hx1 , x2 , 0i) : x1 , x2 ∈ {0, 1}}. Equations (6) and (8) can be verified to be satisfied for this choice of V3 with the alignment condition taking the form this case can be verified to be satisfied, for i = 1, 2, as rowspan(V3 C4,i ) = rowspan(V3 C5,i ) = span({e(hx1 , x2 , 0i) : x1 , x2 ∈ {0, 1}})

(11)

While this shows that optimal repair is achieved, all the remains to be shown is that the code is an MDS code, i.e., Property 1. This is shown in Appendix I, for the generalization of this code to arbitrary values of (n, k). Next, we describe this generalization. B. The (n, k) permutation code This is a natural generalization of the (5, 3) code for general values of (n, k), with L = (n− k)k . To describe this generalization, we define function χ ~ i (m) = (φ1 (m), φ2 (m), , . . . , φi−1 (m), φi (m)⊕1, φi+1 (m), φi+2 (m), . . . , φk (m)), where the operator ⊕ represents an addition modulo (n − k). In other words, χ ~ i (m) essentially modifies the ith position in the base (n − k) representation of m − 1, by addition of 1 modulo (n − k). Remark 1: For the (5, 3) Permutation Code described previously, note that the mth row of Pi is e(h~ χi (m)i). In other words, for the (5, 3) Permutation Code described above, the mth component of Pi a is equal to a(h~ χi (m)i). Remark 2: h~ χi (1)i, h~ χi (2)i, . . . , h~ χi ((n − k)k )i is a permutation of 1, 2, . . . , (n − k)k for any i ∈ {1, 2, . . . , k}. Therefore, given a L × 1 vector a,    T a h~ χi (1)i , a h~ χi (2)i , . . . , a h~ χi ((n − k)k )i

is a permutation of a. We will use this permutation to construct our codes. In this code, we have L = M/k = (n − k)k , so that the k sources, a1 , a2 , . . . , ak are all (n − k)k × 1 vectors and the coding matrices are (n − k)k × (n − k)k matrices. Consider the permutation matrix Pi defined as    e h~ χi (1)i   e h~ χi (2)i   Pi =  (12)  ..   .  e h~ χi ((n − k)k )i

for i = 1, 2, . . . , k, where e(1), e(2), . . . , e((n − k)k ) are the rows of the identity matrix I(n−k)k . Note that because of Remark 2, the above matrix is indeed a permutation matrix. Then, the coding matrices are defined as Cj,i = λj,i Pj−k−1 . i Thus, to understand the structure of the above permutation, consider an arbitrary column vector T a = a(1) a(2) . . . a((n − k)k ) . Then, let j = hx1 , x2 , x3 , . . . , xk )i for 1 ≤ j ≤ (n − k)k . Then, the jth component of Pi a is a(h(x1 , x2 , . . . , xi−1 , xi ⊕ 1, xi+1 , . . . , xk )i). Thus, we can write dk+r+1 (hx1 , x2 , . . . , xk i) =

λk+r+1,1 a1 (x1 ⊕ r, x2 , x3 , . . . , xk ) + λk+r+1,2 a2 (x1 , x2 ⊕ r, x3 , . . . , xk ) + . . . + λk+r+1,k ak (x1 , x2 , x3 , . . . , xk ⊕ r)

where r ∈ {0, 1, 2, . . . , n − k − 1}. This describes the coding matrices. Now, in case of failure of node l, the rows of the repair matrices Vl are chosen from the set {e(m) : φl (m) = 0}. Since φl (m) can take n − k values, this construction has Lk = (n − k)k−1 rows for Vl as required. Because of the construction, we have the following interference alignment relation for i 6= l, j ∈ {k + 1, k + 2, . . . , n} rowspan(Cj,i Vl ) = rowspan({e(m) : φl (m) = 0}). Further, rowspan(Cj,l Vl ) = rowspan({e(m) : φl (m) = j − k − 1}). for j ∈ {k + 1, k + 2, . . . , n} so that (8) is satisfied and the desired signal can be reconstructed from the interference. All that remains to be shown is the MDS property. This is shown in Appendix I. VI. E XPLICIT C ONSTRUCTION

OF

P ERMUTATION C ODES

FOR

n − k ∈ {2, 3}

While, theoretically, any (n, k) MDS code could be used to build distributed storage systems, in practice, the case of having a small number of parity nodes, i.e. small values of n − k, is especially of interest. In fact, a significant portion of literature on use of codes for storage systems is devoted to building codes for the cases of (n−k) ∈ {2, 3} with desirable properties (See, for example, [1]–[4]). While these references focused on constructing MDS codes with efficient encoding and decoding properties, here, we study the construction of MDS codes for n − k ∈ {2, 3} with desirable repair properties. In the previous section, we provided random code constructions based on permutation matrices. In this section, we further strengthen our constructions by providing explicit code constructions for the important case of n−k ∈ {2, 3}. Note that the codes constructed earlier were random constructions because of the fact that scalars λj,i were picked randomly from the field. Further, note that, as long as λj,i , j = k + 1, k + 2, . . . , n, i = 1, 2, . . . , k are any set n−1 of non-zero scalars, the repair bandwidth for failure of a single systematic node is n−k units as required. The randomness of the scalars λj,i was used in the previous section to show the existence of codes which satisfy the MDS property. In this section, for the two cases of n − k = 2 and n − k = 3, we choose these scalars explicitly

(i.e., not randomly) so that the MDS property is satisfied. For both cases, the scalars λi,j are chosen as λj,i = λi−1 j

(13)

so that we have Cj,i = (λj Pj )i−1 for j = {1, 2, . . . , n − k}, i = 1, 2 . . . , k. If n − k = 2, we choose q ≥ (k + 1) and choose λ1 , λ2 , . . . , λk are distinct non-zero elements from the field. With this choice of scalars, in Appendix II, we show that the code satisfies the MDS property. For example, for the case of n = 5, k = 3 described previously, we can choose the field q = 4 and λ4,1 = λ4,2 = λ4,3 = 1 λ5,1 = 1, λ5,2 = 2, λ5,3 = 3. This choice of scalars λj,i ensures the MDS property as shown in Appendix II For n − k = 3, we choose λ1 , λ2 , . . . , λk to be k non-zero elements in the field Fq satisfying λi 6= λj , λi + λj 6= 0

(14)

for all i 6= j, i, j ∈ {1, 2, . . . , k}. Note that elements λi satisfying the above conditions can be chosen satisfying ′ ′ the above properties by choosing q ≥ 2k + 1 and ensuring that λi + λi = 0 ⇒ λi ∈ / {λ1 , λ2 , . . . , λk }. In Appendix II, we also show that the code described here for the case of n − k = 3 is an MDS code. VII. C ONCLUSION In this paper, we provide a class of MDS codes with optimal repair properties, in terms of repair bandwidth, for a single failed systematic node. We show that our codes are optimal, not only in terms of repair bandwidth, but also in terms of the amount of disk access during the recovery of a single failed node. We also provide an explicit construction over relatively small field sizes for the special cases of storage systems with 2 parity nodes, or 3 parity nodes. Since we effectively provide the first set of minimum storage regenerating (MSR) codes for arbitrary (n, k), this work can be viewed as a stepping stone towards implementation of MDS codes in distributed storage systems. From a theoretical perspective, since coding matrices in the storage set up are analogous to channel matrices in wireless channels, our codes may be viewed as a tool to create interference alignment toy examples for wireless channels. In fact, our An interesting exercise in this direction would be to test if our example constructions could be modified, perhaps via an appropriate change of basis, to generate constructions satisfying properties desired for ergodic interference alignment [19]. Such an exercise could lead to extensions of the powerful idea of ergodic interference alignment to more general contexts. From the perspective of storage systems, there remain several unanswered questions. First, there remains open the existence of finite codes which can achieve more efficient repair of parity nodes as well, along with systematic nodes. Second, we assume that the new node connects to all d = n − 1 surviving nodes in the system. An interesting question is whether finite code constructions can be found to conduct efficient repair when the new node is restricted to connect to a subset of the surviving nodes. While asymptotic constructions satisfying the lower bounds have been found for both these problems, the existence of finite codes satisfying these properties remain open. Finally, the search for repair strategies of existing codes, which is analogous to the search of interference alignment beamforming vectors for fixed channel matrices in the context of interference channels, remains open. While iterative techniques exist for the wireless context [20], [21], they cannot be directly extended to the storage context because of the discrete nature of the optimization problem in the latter context. Such algorithms, while explored in the context of certain classes of codes in [22], [23], are an interesting area of future work. A PPENDIX I MDS P ROPERTY We intend to show that the determinant of the matrix in (5) is a non-zero polynomial in Λ = {λj,i , j = k + 1, k + 2, . . . , n, i = 1, 2, . . . , k} for any j1 , j2 , . . . , jk ∈ {1, 2, . . . , n}. If we show this, then, each MDS constraint corresponds to showing that a polynomial pj1 ,j2 ,...,jk (Λ) is non-zero. Using the Schwartz-Zippel Lemma on the product of these polynomials πj1 ,j2 ,...,jk pj1 ,j2 ,...,jk (Λ) automatically implies the existence of Λ so that the

MDS constraints are satisfied, in a sufficiently large field. Therefore, all that remains to be shown is that the determinant of (5) is a non-zero polynomial in Λ. We will show this by showing that there exists at least one set of values for the variables Λ such that the determinant of (5) is non-zero. To show this, we first assume, without loss of generality, that j1 , j2 , . . . , jk are in ascending order. Also, let j1 , j2 , . . . , jk−m ∈ {1, 2, . . . , k} and jk−m+1 , jk−m+2 , . . . , jk ∈ {k + 1, k + 2, . . . , n}. For simplicity we will assume that j1 = 1, j2 = 2, . . . , jk−m = k − m. The proof for any other set {j1 , j2 , . . . , jk−m } is almost identical to this case, except for a difference in the indices used henceforth. Substituting the appropriate values of Cj,i , the matrix in (5) can be written as   I ... 0 ... 0   0 ... 0 ... 0     .. . . . . .. .. .. ..   .     0 . . . I . . . 0 (15)    λjk−m+1 ,1 P1sk−m+1 . . . λjk−m+1 ,k−m Psk−m+1 . . . λjk−m+1 ,k Psk−m+1  k−m k     .. .. .. .. ..   . . . . . sk sk sk λjk ,1 P1 ... λjk ,k−m Pk−m ... λjk ,k−m Pk where si = ji − k − 1. Now, if  0 λj,i = 1

if (j, i) ∈ / {(jt , t) : t = k − m + 1, k − m + 2, . . . , k} otherwise



then the above matrix is a block diagonal matrix. Therefore, its determinant evaluates to the product of the k Y determinant of its diagonal blocks, i.e., |Psuu | which is non-zero. This implies that the determinant in u=k−m+1

(5) is a non-zero polynomial in Λ as required. This completes the proof. A PPENDIX II P ROOF

OF

MDS PROPERTY

FOR EXPLICIT CONSTRUCTIONS OF

S ECTION VI

We need to show Property 1. Before we show this property, we begin with the following Lemma which shows that the coding matrices for the Permutation Code commute. Lemma 2: m1 m2 2 1 = Pm Pm j Pi i Pj where Pi is chosen as in (12). Proof: In order to show this, we show that Pi Pj a = Pj Pi a for any 2k × 1 dimensional column vector a. Assuming without loss of generality that i < j, this can be seen by verifying that   j−i−1 entries i − 1 entries z }| { z }| {   a(h 0, 0, . . . , 0 , 0 ⊕ m1 , 0, . . . , 0 , 0 ⊕ m2 , 0, . . . , 0i)     a(h1, 1, . . . , 1, 1 ⊕ m1 , 1, . . . , 1, 1 ⊕ m2 , 1, . . . , 1i) c Pi Pj a = Pj Pi a =    ..   . a(hk − 1, k − 1, . . . , k − 1, (k − 1) ⊕ m1 , 0, . . . , 0, (k − 1) ⊕ m2 , k − 1, . . . , k − 1i)

m2 m1 m2 1 In other words, the < r1 , r2 , . . . , rk >th element of both Pm i Pj a and Pj Pi a can be verified to be

a(< r1 , r2 , . . . , ri−1 , ri ⊕ m1 , ri+1 , . . . , rj−1 , rj ⊕ m2 , rj+1 , . . . , rk i) Now, we proceed to show the 1 property for n − k ∈ {2, 3}. Without loss of generality, we assume that that j1 , j2 , . . . , jk are in ascending order. Case 1: n − k = 2: We divide this case into 2 scenarios. In the first scenario , j1 , j2 , . . . , jk−1 ∈ {1, 2, . . . , k} and jk ∈ {k + 1, k + 2}. Note that this corresponds to reconstructing the data from k − 1 systematic nodes and a single parity node. Now, substituting this in equation (15) in Appendix I, and expanding this determinant along the first (k − 1)L columns, we get this determinant to be equal to |Cjk ,i |. Therefore, the desired property is

equivalent to the matrix Cj,i = (λi Pi )j−k−1 to be full rank for all j ∈ {k + 1, k + 2, . . . , n}, i = 1, 2, . . . , k. This scenario is hence, trivial. Now, in the second scenario, consider the case where j1 , j2 , . . . , jk−2 ∈ {1, 2, . . . , k} and jk−1 = k + 1, jk = k + 2. This corresponds to the case where the original sources are reconstructed using k − 2 systematic nodes, and both parity nodes. By substituting in (15) and expanding along the first (k − 2)L rows, the MDS property can be shown to be equivalent showing that the matrix   I I λi Pi λj Pj having full rank. Now, note that the matrices Pi and Pj . On noting that the determinant of commuting blockmatrices can be evaluated by using the element-wise determinant expansion over blocks [24], the determinant of the matrix can be written as −1 −1 −1 |λj Pj − λi Pi | = λ−1 j |Pi ||Pj Pi − λi λj I|. −1 Note that the above expression is equal to 0 if and only if λi λ−1 j is an Eigen-value of the permutation matrix Pj Pi . −1 −1 Since Pj Pi is a permutation matrix, 1 is its only Eigen value. As noted in (13), we have λi 6= λj ⇒ λi λj 6= 1, and hence, the determinant shown above is non-zero and the matrix is full-rank as required. Case 2: n − k = 2: We divide this case into 5 scenarios as listed below. 1) j1 , j2 , . . . , jk−1 ∈ {1, 2, . . . , k} and jk ∈ {k + 1, k + 2}. 2) j1 , j2 , . . . , jk−2 ∈ {1, 2, . . . , k} and jk−1 = k + 1, jk = k + 2}. 3) j1 , j2 , . . . , jk−2 ∈ {1, 2, . . . , k} and jk−1 = k + 2, jk = k + 3}. 4) j1 , j2 , . . . , jk−2 ∈ {1, 2, . . . , k} and jk−1 = k + 1, jk = k + 3}. 5) j1 , j2 , . . . , jk−3 ∈ {1, 2, . . . , k} and jk−2 = k + 1, jk−1 = k + 2, jk = k + 3}. Property 1 can be proved to hold in the first two scenarios using arguments similar to Case 1. For the third scenario, again, using arguments similar to Case 1, showing the MDS property is equivalent to showing that the matrix   λi Pi λj Pj λ2i P2i λ2j P2j

has a full rank. The above matrix has a full rank because is equal to     λi Pi 0 I I × 0 λj Pj λi Pi λj Pj and both the matrices of the above product have full rank. Now, for the fourth scenario, we need to show that the matrix   I I λ2i P2i λ2j P2j has a full rank. This can be seen on noting that the determinant of the above matrix evaluates to 2 −2 2 −2 |λ2j P2j − λ2i P2i | = λ2j |P−2 i ||Pj Pi − λi λj I|

which is non-zero if λ2i λ−2 6= 1. This is ensured because the conditions in (14) imply that λ2i 6= λ2j . Finally, we j consider to scenario 5 where we need to show that all the information can be recovered from k − 3 systematic nodes, and all 3 parity nodes. For this, we need   I I I  λi Pi λj Pj λl Pl  λ2i P2i λ2j P2j λ2l P2l to have full rank. Note that the above matrix has a block Vandermonde structure, where each of the blocks commute pairwise because of Lemma 2. This fact, combined with the fact that commuting block matrices can be expanded in a manner, similar to the element-wise determinant expansion, implies that the determinant of the above matrix is equal to Y |λi Pi − λj Pj | i,j

The determinant is non-zero since λi 6= λj if i 6= j. This completes the proof of the desired MDS property.

R EFERENCES [1] M. Blaum, J. Brady, J. Bruck, and J. Menon, “Evenodd: an optimal scheme for tolerating double disk failures in raid architectures,” in Computer Architecture, 1994., Proceedings the 21st Annual International Symposium on, pp. 245 –254, Apr. 1994. [2] C. Huang and L. Xu, “Star : An efficient coding scheme for correcting triple storage node failures,” Computers, IEEE Transactions on, vol. 57, pp. 889 –901, July 2008. [3] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, “Row-diagonal parity for double disk failure correction,” in In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST), pp. 1–14, 2004. [4] J. S. Plank, “The raid-6 liber8tion code,” The International Journal of High Performance Computing and Applications, vol. 23, pp. 242–251, August 2009. [5] A. Dimakis, P. Godfrey, M. Wainwright, and K. Ramchandran, “Network coding for distributed storage systems,” in IEEE INFOCOM, pp. 2000 –2008, may 2007. [6] Y. Wu and A. Dimakis, “Reducing repair traffic for erasure coding-based storage via interference alignment,” in IEEE International Symposium on Information Theory, pp. 2276 –2280, 28 2009-july 3 2009. [7] N. B. Shah, R. K. V., P. V. Kumar, and K. Ramachandran, “Explicit codes minimizing repair bandwidth for distributed storage,” CoRR, vol. abs/0908.2984, 2009. http://arxiv.org/abs/0908.2984. [8] C. Suh and K. Ramchandran, “Exact regeneration codes for distributed storage repair using interference alignment,” CoRR, vol. abs/1001.0107, 2010. http://arxiv.org/abs/1001.0107. [9] V. R. Cadambe, S. Jafar, and H. Maleki, “Distributed data storage with minimum storage regenerating codes - exact and functional repair are asymptotically equally efficient,” CoRR, vol. abs/1004.4299, April 2010. http://arxiv.org/abs/1004.4299. [10] C. Suh and K. Ramchandran, “On the existence of optimal exact-repair mds codes for distributed storage,” CoRR, vol. abs/1004.4663, April 2010. http://arxiv.org/abs/1004.4663. [11] B. Gaston and J. Pujol, “Double circulant minimum storage regenerating codes,” CoRR, vol. abs/1007.2401, 2010. http://arxiv.org/abs/1007.2401. [12] M. Maddah-Ali, A. Motahari, and A. Khandani, “Communication over MIMO X channels: Interference alignment, decomposition, and performance analysis,” in IEEE Trans. on Information Theory, pp. 3457–3470, 2008. [13] S. Jafar and S. Shamai, “Degrees of freedom region for the MIMO X channel,” IEEE Trans. on Information Theory, vol. 54, pp. 151–170, Jan. 2008. [14] V. Cadambe and S. Jafar, “Interference alignment and the degrees of freedom of the k user interference channel,” IEEE Trans. on Information Theory, vol. 54, pp. 3425–3441, Aug. 2008. [15] R. K. V., N. B. Shah, and P. V. Kumar, “Optimal exact-regenerating codes for distributed storage at the msr and mbr points via a product-matrix construction,” CoRR, vol. abs/1005.4178, 2010. http://arxiv.org/abs/1005.4178. [16] N. B. Shah, R. K. V., P. V. Kumar, and K. Ramchandran, “Distributed storage codes with repair-by-transfer and non-achievability of interior points on the storage-bandwidth tradeoff,” CoRR, vol. abs/1011.2361, 2010. http://arxiv.org/abs/1011.2361. [17] S. Yekhanin, “Locally decodable codes,” in Now Publishers, pp. 1878 –1882, june 2010. [18] D. Cullina, A. Dimakis, and T. Ho, “Searching for minimum storage regenerating codes,” Proceedings of the 47th Annual Allerton Conference on Communication, Control and Computation, Sep 2009. http://arxiv.org/abs/0910.2245. [19] B. Nazer, M. Gastpar, S. A. Jafar, and S. Vishwanath, “Ergodic interference alignment,” June 2009. [20] K. Gomadam, V. Cadambe, and S. Jafar, “Approaching the capacity of wireless networks through distributed interference alignment,” in Submitted to Globecom 2008. Preprint available through the authors website, March 2008. [21] S. Peters and R. Heath, “Interference alignment via alternating minimization,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pp. 2445 –2448, April 2009. [22] Z. Wang, A. G. Dimakis, and J. Bruck, “Rebuilding for array codes in distributed storage systems,” ACTEMT: Workshop on the Application of Communication Theory to Emerging Memory Technologies, December 2010. http://arxiv.org/abs/1009.3291. [23] L. Xiang, Y. Xu, J. C. Lui, and Q. Chang, “Optimal recovery of single disk failure in rdp code storage systems,” in Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems, SIGMETRICS ’10, (New York, NY, USA), pp. 119–130, ACM, 2010. http://doi.acm.org/10.1145/1811039.1811054. [24] I. Kovacs, D. S. Silver, and S. G. Williams, “Determinants of commuting-block matrices,” The American Mathematical Monthly, vol. 106, no. 10, pp. pp. 950–952, 1999. http://www.jstor.org/stable/2589750.