Local Codes With Cooperative Repair in Distributed Storage of Cyber-Physical-Social Systems

Integrating cyber, physical, and social spaces together, cyber-physical-social systems (CPSS) bring more conveniences to humans. For practical applications and user convenience, it is essential that the Big Data produced in CPSS be stored in the distributed storage systems of CPSS. In this paper, we study the fault tolerance scheme for distributed storage systems of CPSS, and propose a framework that can recover multiple failed nodes simultaneously. Considering the reliability of storage nodes in distributed storage systems, the research on locally repairable codes has mostly focused on repairing failed nodes within each repair group. However, when entire repair groups have failed, existing locally repairable codes cannot repair more than one failed group. In this paper, local codes with cooperative repair that can recover more than one failed group are proposed. Specifically, the proposed local codes are constructed based on minimum storage regenerating (MSR) codes, and have an interleaving structure among the local codes, so that the parity symbols of any local code can be generated from the MSR codes in its two adjacent local codes. Taking advantage of this property, more than one failed local group can be repaired cooperatively by their adjacent local groups with lower repair locality. Furthermore, the key parameters of local codes with cooperative repair are derived. Theoretical analysis and simulation results show that, compared with previous codes with local regeneration, our codes have higher bandwidth overhead when repairing failed nodes, but advantages in storage overhead and repair locality either for repair of a single failed node or one failed local group. Moreover for a single failed local group, local codes with cooperative repair achieve almost the same tradeoff curve of storage overhead and bandwidth overhead as MSR-local codes and minimum bandwidth regenerating local (MBR-local) codes.


I. INTRODUCTION
Since the traditional cloud storage model runs in a centralized storage manner, single node of failure may lead to the collapse of system. With the development of Cyber-physical-social The associate editor coordinating the review of this manuscript and approving it for publication was Xiaokang Wang. systems (CPSS), distributed storage mode has entered the public view. The distributed storage approach can solve the problem of single node of failure in traditional cloud storage systems and enjoy a number of advantages over centralized storage, such as low price and high throughput. Also CPSS have seen significant adoption in the past few years and show promise to design applications without any centralized reliance on third parties. Accordingly, the reliable storage of distributed storage systems is attracting more and more attention. Also, Wei et al. discussed the secure data storage and recovery in industrial network environments of CPSS in [1]- [3].
In large distributed storage systems (DSSs), node failure is inevitable along with data loss. To retain high availability of DSSs, Dimakis et al. introduced regenerating codes based on network coding [4], which can reduce repair bandwidth compared with a combination of replication and erasure codes. Rashmi et al. presented explicit constructions of regenerating codes that reach the two extreme points [5], termed as the minimum storage regenerating (MSR) and minimum bandwidth regenerating (MBR) points, achieving the minimum storage cost and minimum bandwidth cost during node repair process respectively. Furthermore, Ernvall gave the constructions of regenerating codes between the MSR point and the MBR point [6], [7].
Apart from storage and repair bandwidth overhead, repair locality, the number of nodes contacted by the replacement node in the repair process [8]- [10], is an important cost metric of data repair. The repair locality relates to the disk input/output (I/O) overhead, which has been the main performance bottleneck during repairing failed nodes. To reduce the disk I/O overhead, Papailiopoulos et al. proposed simple regenerating codes that employ simple XORs over maximum distance separable (MDS) coded packets to perform exact repair [11]. Moreover, Papailiopoulos and Dimakis presented optimal locally repairable codes (LRCs) based on simple combinations of RS codes that can achieve arbitrarily high data rates and better repair locality [12]. Based on maximum rank distance (MRD) Gabidulin codes, Rawat et al. constructed optimal LRCs with two-layer encoding structure [13], and derived the upper bound on the amount of data stored in DSSs. However, considering the properties of Gabidulin codes [14], the complexity of the optimal LRCs will increase exponentially with the number of nodes in DSSs.
Kamath et al. focused on the construction of optimal codes with local regeneration [15], which combine the advantages of both LRCs and regenerating codes, aiming to simplify node repair and reduce encoding complexity. The constructed optimal codes with local regeneration, such as MSR-local codes and MBR-local codes, contain multiple local codes and global parities, in which the local codes are either MSR codes or MBR codes. Also, Rawat et al. gave another construction of MSR-LRC [16], that the file is encoded using a Gabidulin code, the codeword of the Gabidulin code is partitioned into local groups, and each of these local groups is encoded using an MSR code.
When one entire repair group has failed, the local code in the failed group needs to be recovered as a whole. MSR-local codes and MBR-local codes proposed by Kamath et al. [15] have the ability to repair the failed local codes, even though their initial purpose was to simplify node repair. Nevertheless, based on their construction, MSR-local codes and MBR-local codes can repair only one failed local code, but are incapable of repairing two or more failed local codes at the same time. Furthermore, it is necessary for MSR-local codes and MBRlocal codes to collect all the remaining local codes and the global parities for repairing the only failed local code, which leads to a high repair locality.
Inspired by the distributed parity of RAID 5 [17], the global parities of MSR-local codes may also be broken down into multiple distributed parities to avoid the limitations above of repairing failed local codes as MSR-local codes. In this paper, we present an explicit construction of local codes with cooperative repair (LCCR). Compared with MSR-local codes or MBR-local codes, each local code of LCCR includes an MSR code and a distributed parity. Particularly, the MSR code in LCCR is systematic, including information symbols and parity symbols. Furthermore, the distributed parity part of each local code in LCCR can be generated from the MSR codes in its two adjacent local codes, which can be regarded as an interleaving structure among the local codes. Based on the structure of the proposed LCCR, more than one failed local group can be recovered cooperatively by their adjacent groups with lower repair locality. Additionally, the key parameters of LCCR are derived. Theoretical and numerical analyses show that, the proposed LCCR have performance improvement in repair locality for node failures and local group failures, and achieve almost the same tradeoff curve of storage overhead and bandwidth overhead as MSR-local codes and MBR-local codes for a single failed local group. This paper is organized as follows. In Section 2, we review relevant background. Explicit construction of LCCR is presented in Section 3, followed by the derivation of the minimum distance of LCCR. Section 4 provides the performance analyses, including repair of a single failed node, one failed local group, and multiple failed local groups. Section 5 concludes the paper.

A. MSR CODES AND MBR CODES
In a DSS, a message file of M symbols over GF(q) is dispersed across n active storage nodes, each of which has a storage capacity of α code symbols. Suppose a newcomer receives the same amount of information from each of the existing nodes, any k storage nodes can recover the original file. When a storage node failed, a replacing node collects β symbols each from any d surviving nodes, and hence the total repair bandwidth is γ = dβ.
Since the entire original file can be reconstructed from any k storage nodes, any node can be recovered in the DSS. In this way, a total bandwidth of kα ≥ M bits is required to repair any failed node with capacity of α symbols. According to the results in [4], by connecting to d surviving nodes, less bandwidth of γ = dβ will be required to recover the failed node. There exists an optimal tradeoff between the storage overhead per node α and the repair bandwidth γ , and the codes that attain the optimal tradeoff above are called regenerating codes. The codes that attain the minimum storage overhead are termed minimum-storage regenerating (MSR) codes, with .
Similarly, the codes that achieve minimum repair bandwidth are referred to minimum-bandwidth regenerating (MBR) codes, with The repair process above can be either functional repair or exact repair [6], [7]. For functional repair, the nodes may change over time. If a node v old i is lost and we get a new node v new i in the repair process, then we may have v old On the other hand, exact repair means that new node v new i is always the same as the old one v old i . Exact repair can obviate additional communication overheads during the repair process, and also avoid retuning the reconstruction and repair algorithm. In this paper, exact repair is considered to maintain the local codes in systematic form during the repair operation.

B. MSR-LOCAL CODES AND MBR-LOCAL CODES
By combining the advantages of LRCs and regenerating codes, Kamath et al. proposed codes with local regeneration [15], which carry locality over a vector alphabet. The constituent local codes themselves are regenerating codes, such as MSR codes or MBR codes. For example, the construction of MSR-local codes is shown in Fig.1.
The parent MSR code m t , m t P 1 , m t P 2 is ((n L + , r, d), (α, β)) MSR code with n L = r + δ − 1 and d ≤ r + δ − 2, and its generator matrix is G 0 = [G L |Q ]. Puncturing the parent MSR code to the first n L symbols, we can obtain an ((n L , r, d), (α, β)) MSR code m t , m t P 1 . The MSR-local codes C MSR-local consist of MSR code m t i , m t i P 1 as local code C i (1 ≤ i ≤ m), and ( m i=1 m i ) t P 2 as global parities. The corresponding generator matrix G MSR-local is given by Similarly, the MBR-local codes can be realized by the same method as MSR-local codes. The MBR-local codes C MBR-local proposed in [15] also compose of m supportdisjoint MBR codes and global parities. Specially, each MBR code in MBR-local codes is a repair-by-transfer (RBT) ((n L , r, d), (α, β), K L ) MBR code [18]. Thus the desired MBR-local codes C MBR-local will have length n = mn L + , and dimensionK = mK L .

III. CONSTRUCTION OF LOCAL CODES WITH COOPERATIVE REPAIR
For MSR-local and MBR-local codes, it is required to collect all the remaining local codes and the global parities to repair one failed local code, with high repair locality. RAID 4 consists of block-level striping with a dedicated parity disk, whereas RAID 5 with distributed parity. Unlike in RAID 4, RAID 5's parity information is distributed among the drives, which evens out the stress of a dedicated parity disk among all RAID members. Inspired by the distributed parity of RAID 5, the global parities of MSR-local codes can also be broken down into multiple distributed parities to avoid the limitations of repairing the failed local codes. Thus, we construct local codes with cooperative repair (LCCR), in which each local code includes an MSR code and a distributed parity.
Based on the MSR code above, an [n, K , d min , α] LCCR C is constructed with n = m(n L + ) and K = mK L , in which the parameters δ, are chosen such that = δ − 1. LCCR C has m local codes C 1 , C 2 , C 3 , · · · , C m , each of which includes an MSR code and a distributed parity. To be specific, , which can be generated by the submatrix Thus generator matrix G of the desired LCCR C is And the local code C 1 (or C m ) can be generated by the submatrix composed of the first two columns of generator matrix G (or the last two columns). According to the matrix construction above, the distributed parity symbols of each local code can be created by the MSR codes in its two adjacent local codes, which can be regarded as a kind of mutual interleaving structure among the local codes.
The code construction of LCCR C is illustrated in Fig. 2.
is also an MSR code, achieved by deleting the parities m i P i+2 and m i P i−2 of parent MSR code C i [15]. Based on the punctured MSR code C i , the constructed LCCR C contains m local codes C 1 , C 2 , C 3 , · · · , C m . Concretely, a local code C i includes an MSR code m i , m i P i and a distributed parity m i−1 P i+1 +m i+1 P i−1 , which can be formed by the MSR code m i−1 , m i−1 P i−1 and m i+1 , m i+1 P i+1 in its two adjacent local codes. A local code C i has two adjacent local codes The theorem below identifies the parameters of LCCR so constructed and derives the minimum distance d min of LCCR.
Theorem 1: Consider LCCR C constructed in Construction 1 in which the parameters δ and are chosen such that = δ − 1. Then LCCR C with (r, δ) information locality has length n = m(n L + ) and dimension K = mK L . The minimum distance of C is given by Proof: Towards LCCR C in Fig. 2, it suffices to show that any non-zero codeword C has Hamming weight wt(C) ≥ δ + 2 . First of all, note that if C has non-zero components belonging to one or more local codes, then wt(C) ≥ δ + 2 , since all MSR codes in local codes have minimum distance δ and all distributed parities have minimum distance . Next, consider the complementary case where the non-zero components of C are restricted to one MSR code and two relevant distributed parities of local codes, such as the MSR code m 1 , m 1 P 1 and two distributed parities m 1 P 3 in local code C 2 and m 1 P m−1 in C m . And when the all-zero code symbols are deleted from each codeword, then the resultant punctured codeword C punc lies in the space of the resulting matrix which is reduced from the generator matrix G. As the information vector remains unchanged and still is [ , that is the parent MSR code C 1 with Hamming weight wt(C punc ) = δ+2 . By the analysis above, the minimum distance d min of LCCR is δ + 2 .

IV. PERFORMANCE COMPARISON
In this section, we choose a representative set of codes, including LCCR, MSR-local codes and MBR-local codes, to analyze their performances in storage overhead, repair bandwidth overhead, and repair locality. Since the bandwidth overhead and the repair locality depend on the failures of local groups or nodes, the cases of a single node failure, one local group failure, and multiple local group failures are discussed separately.

A. STORAGE OVERHEAD
In this paper, we adopt the definition of storage overhead nα K given in [15], which is defined to be the ratio of the total number of storage units required to store the encoded symbols over the number of message symbols contained in the original file. The storage overheads of LCCR, MSR-local codes and MBR-local codes are given by As MSR-local codes and MBR-local codes have the similar code structure, the parameter m can be achieved with the same value for MSR-local codes and MBR-local codes as n and d min fixed. Similarly, the parameters r, δ, in MSR-local codes and MBR-local codes also have the same values. Thus, we can compare the storage overhead of MSR-local codes with that of MBR-local codes directly. For MBR-local codes, its storage overhead is given by Since in MBR-local codes, dimension and thus r ≥ 2. Meanwhile with δ ≥ 3, it can be deduced that The storage overhead of MBR-local codes is lower bounded by that of MSR-local codes, and the factor η reflects the extra storage overhead of MBR-local codes compared with MSR-local codes.
With the same code length n and minimum distance d min , the values of the parameters m, r, δ, obtained in LCCR will differ from those of MSR-local codes and MBR-local codes, since its code length n = m(n L + ) and its minimum distance d min = δ+2 . Thus, we cannot compare the storage overhead of LCCR with that of MSR-local codes, or that of MBR-local codes directly according the formulae defined by nα K . In the next subsection, the comparison about the storage overhead of LCCR, MSR-local codes, or MBR-local codes will be plotted by numerical analysis as n = 120 and d min = 16.

B. REPAIR OF A SINGLE FAILED NODE
To maintain the completeness of codes, even though the failed nodes are in the parity part, they all need to be repaired, either the global parities of MSR-local codes or MBR-local codes, or the distributed parity of LCCR. We only discuss the case of one single node failure, not including the case of multiple node failures, which can be regarded as a generalization of the case of one node failure.
Although the proposed LCCR repairs the failed local codes as a whole, LCCR, which is constructed based on MSR codes, has the ability to repair the failed nodes in local groups. When the failed nodes of one local group are in the MSR code part, we can adopt the same method as MSR codes to repair. If there are several failed nodes in the distributed parity of one local code, we can repair the failed nodes by collecting the MSR codes in its two adjacent local codes. Specifically for LCCR, take as an example that there exists a single failed node in the distributed parity m i−1 P i+1 + m i+1 P i−1 of local code C i , as shown in Fig. 3. To repair the single failed node, local group i should collect symbols m i−1 and m i−1 P i−1 from local group i − 1, and symbols m i+1 and m i+1 P i+1 from local group i + 1. Furthermore by simple matrix operations, the distributed parity m i−1 P i+1 + m i+1 P i−1 can be obtained, and meanwhile the single failed node in distributed parity m i−1 P i+1 + m i+1 P i−1 can be repaired.
Next, we discuss the performance of repair locality for LCCR, MSR-local codes and MBR-local codes in the case of a single node failure. Due to the construction of LCCR that the failed nodes may be located in the MSR code of local codes, or in the distributed parity, we should consider the two cases together. Lemma 1: The locality to repair one single failed node located in the MSR code of LCCR is n L − 1.
Lemma 2: The locality to repair one single failed node located in the distributed parity of LCCR is 2n L .
Theorem 2: The locality of LCCR to repair one single failed node is (n L − 1) · n L + 2n L · n L + .
Since the local codes of MSR-local codes are MSR codes, the repair locality of MSR-local codes is the same as that of MSR codes when repairing one single failed node in local codes, which equals n L − 1 = r + δ − 2. Meanwhile, if the single failed node is located in the global parities, the global parities of MSR-local codes should be repaired as a whole, which needs mr nodes in total. The repair locality of MSR-local codes is (n L − 1) · mn L + mr · mn L + = (n L − 1) · n L + r · n L + m .
Adopting the same method, we can obtain the repair locality of MBR-local codes which equals that of MSR-local codes, as the formula describes above.
Using numerical analysis, we can compare the locality of LCCR with that of MSR-local codes, or that of MBR-local codes. Concretely, we choose common length n = 120 and common minimum distance d min = 16. For comparison conveniently, we set the parameter m ≥ 3 since there is at least m ≥ 3 local codes in LCCR. Fig. 4 illustrates the repair locality versus storage overhead for LCCR, MSR-local codes and MBR-local codes. From Fig. 4, MBR-local codes have the same repair locality as that of MSR-local codes as they have the same values for parameter m, r, δ, , which is consistent with the theoretical analysis. Through numerical analysis, LCCR has the smallest repair locality under the same storage consumption.
Furthermore, we analyze the bandwidth overhead to repair one single failed node. The normalized bandwidth overhead can be calculated by nω K [15], where ω denotes the average repair bandwidth for repairing a single failed node.
Theorem 3: The bandwidth overhead of LCCR to repair one single failed node is 38626 VOLUME 8, 2020 Proof: According to the repair process of a single failed node in LCCR, the failed node may be located in the MSR code of local codes, or in the distributed parity. We can calculate ω LCCR by where ω 1 = n L α r · n L −1 n L −r denotes the average repair bandwidth for repairing a single failed node located in the MSR code of local codes, and ω 2 = 2n L α denotes the repair bandwidth required for repairing one failed node located in the distributed parity. The repair bandwidth needed is 2n L α whether there are one or more failed nodes in the distributed parity, since the distributed parity should be repaired as a whole at any time. According to the analysis above, the bandwidth overhead of LCCR is For MSR-local codes and MBR-local codes, all the information symbols of the m local codes should be collected together to repair one single failed node located in the global parities, which will further increase the repair bandwidth. The repair bandwidth overhead of MSR-local codes and MBRlocal codes can also be calculated by nω K [15].
Lemma 3: For MSR-local codes, the average repair bandwidth for repairing a single failed node located in local codes is ω 1 = n L α r · n L −1 n L −r , and the average repair bandwidth for repairing a failed node located in global parities ω 2 = mK L = mrα. The average repair bandwidth The bandwidth overhead of MSR-local codes to repair one single failed node is

Lemma 4:
In MBR-local codes, each MBR code adopted is a repair-by-transfer (RBT) ((n L , r, d), (α, β), K L ) MBR code [18]. The average repair bandwidth for repairing a single failed node located in local codes is ω 1 = n L − 1, and the average repair bandwidth for repairing a failed node located in global parities ω 2 = mK L = m · rα − r 2 . The bandwidth overhead of MBR-local codes for repairing one single failed node is A summary of repair locality and bandwidth overhead for LCCR, MSR-local codes and MBR-local codes is presented in Table 1, as one single node failed. For the same reason that MSR-local codes and MBR-local codes have the similar code structure, we can compare their bandwidth overhead conveniently, obtaining that B MBR-local < B MSR-local . Similarly, we adopt numerical analysis to compare the bandwidth overhead of LCCR with that of MSR-local codes, or that of MBR-local codes. As n = 120, d min = 16 and m ≥ 3, the performance of LCCR, MSR-local codes and MBR-local codes in repair bandwidth versus storage overhead is shown in Fig. 5.
Since MSR-local codes and MBR-local codes achieve the same values for parameter m, r, δ, while fixing n = 120 and d min = 16, we will obtain the same code number for MSR-local codes and MBR-local codes. Especially in Fig. 5, MSR-local codes and MBR-local codes have similar distribution, furthermore which are mutually corresponding. MBRlocal codes and MSR-local codes will be separately located in the corresponding position of their distribution region of the plot, when they have the same value for parameter m, r, δ, . The storage overhead of MBR-local codes is larger than that of MSR-local codes, and meanwhile its repair bandwidth is smaller than that of MSR-local codes, in accordance with the theoretical analysis. From Fig. 5, the bandwidth overhead of LCCR is larger than that of MBR-local codes and MSR-local codes on the whole, which results from the fact that there is a distributed parity in each local code of LCCR. The average bandwidth overhead to repair a failed node located in the distributed parities of LCCR is larger than that located in the global parities of MSR-local codes and MBR-local codes.
In addition, we consider repairing the failed node located only in the MSR code part of LCCR, or in local codes of MSR-local codes or MBR-local codes. As a summary presented in Table 2, the corresponding repair locality and bandwidth overhead are illustrated in Fig. 6 and Fig. 7 for the case that the failed node is not located in the parity part, i.e., the global parities of MSR-local codes or MBR-local codes, or the distributed parity of LCCR. From Fig. 6, LCCR also has the smallest repair locality compared with MSR-local codes and MBR-local codes. The bandwidth overheads of LCCR, MSR-local codes and MBR-local codes in Fig. 7 are smaller than that in Fig. 5, which coincides with the theoretical analysis above. Moreover in Fig. 7, LCCR have almost the same bandwidth and storage overheads as MSR-local codes, except the point with higher storage overhead.

C. REPAIR OF ONE FAILED LOCAL GROUP
Since the minimum distance d min_local of the MSR code in one local code of LCCR is δ, the failed nodes cannot be repaired by other available nodes of the same local group as the number of failed nodes in one local code exceeds δ − 1. Thus some local groups, in which there exist more than δ − 1 failed nodes, are considered as failed local groups. For LCCR, one failed local group can be repaired cooperatively by its four adjacent local groups. Fig. 8 illustrates the concrete repair procedure for one failed local group corresponding to local code C i (1 ≤ i ≤ m). In Fig. 8, the indices of the adjacent groups are shown without the cyclic shift for  simplicity. Local code C i can be recovered by local codes C i−2 , C i−1 , C i+1 and C i+2 . Local group i − 1 collects data m i−2 and m i−2 P i−2 of local code C i−2 in local group i − 2, and meanwhile local group i + 1 collects data m i+2 and m i+2 P i+2 of local code C i+2 in local group i+2. By collecting m i−2 P i + m i P i−2 of local code C i−1 and m i P i+2 + m i+2 P i of local code C i+1 , local group i can recover data m i P i by matrix operations, and the information data m i by using the generator matrix G i = [I|P i ] of the punctured MSR code C i as δ − 1 ≥ r. Furthermore, collecting data m i−1 and m i−1 P i−1 from local group i − 1, and m i+1 and m i+1 P i+1 from local group i + 1, local group i gains the distributed parity m i−1 P i+1 + m i+1 P i−1 of local code C i also by simple matrix operations. Until now, we achieve the MSR code part and the distributed parity part of local code C i , and complete the repair of local group i. Definition 1 (Locality for Repairing Failed Local Groups): Similar to repairing failed nodes, the number of local groups that participate in the repair of failed local groups is defined as the repair locality for repairing failed local groups. Especially for MSR-local codes and MBR-local codes, when the global  parities are involved in the repair of failed local groups, the global parities should be regarded as one whole local group during repairing.
In Fig. 8, any one failed local group of LCCR can be recovered by four local groups, and in this case the repair locality of LCCR to repair one failed local group is 4. Similarly, we also choose n = 120 and d min = 16. Considering in LCCR, one single failed local group need 4 adjacent local groups to repair, we set m ≥ 5. For MSR-local codes and MBR-local codes, it is essential to collect all the surviving local codes and the global parities to repair the only one failed local group. Since MSR-local codes have m local codes and one global parities part, the repair locality of MSR-local codes is m. For the same reason, the repair locality of MBR-local codes is also m. Consequently, MSR-local codes and MBR-local codes have the same repair locality of m, as shown in Fig. 9. From Fig. 9, the repair locality of LCCR is smaller than that of MSR-local codes and MBR-local codes, not limited by the storage overhead.
Similarly, the bandwidth overhead can be calculated by (mω + ω ) mα, where ω denotes the average repair bandwidth for repairing a single failed local group, and ω the repair bandwidth required for repairing the global parities (specially, ω = 0 for LCCR).

Lemma 6:
In MSR-local codes, ω MSR-local = mrα + α and ω = mrα. The bandwidth overhead of MSR-local codes is given by Lemma 7: The average bandwidth overhead of MBR-local codes to repair one local group can be achieved as with ω MBR-local = m rα − r 2 + α and ω = m rα − r 2 .
A summary of repair locality and bandwidth overhead for one local group failure is presented in Table 3. For the same reason that MSR-local codes and MBR-local codes achieve the same values for parameter m, r, δ, while fixing n and d min , it is obvious that B MBR-local < B MSR-local as repairing one failed local group. MBR-local codes have special local codes, RBT MBR codes. In RBT MBR codes, any two nodes in a fully connected graph have a common code symbol, and the repair process can be accomplished by mere transfer of data without any arithmetic operation. In the same way, we adopt numerical analysis to compare the bandwidth overhead of LCCR with that of MSR-local codes, or MBR-local codes, at the case of one single local group failure.
Similarly choosing n = 120, d min = 16 and m ≥ 5, we compare repair bandwidth versus storage overhead of LCCR, MSR-local codes, and MBR-local codes for the case of one local group failure, as shown in Fig. 10. LCCR, MSRlocal codes and MBR-local codes are almost on the same tradeoff curve of bandwidth overhead and storage overhead, VOLUME 8, 2020 except for the three points with higher storage overhead. Concretely, the bandwidth overhead of MBR-local codes is smaller than that of MSR-local codes, which agrees with the theoretical analysis above. In particular, as the storage overhead is 2, LCCR can achieve almost the same bandwidth overhead of 75 as MSR-local codes, with bandwidth overhead of 76.
In Fig. 10, as the storage overhead of LCCR increases up to 6, the performance of LCCR in bandwidth overhead is worse than that of MSR-local codes and MBR-local codes. From the definition of storage overhead that is the ratio of the total number of encoded symbols over the number of message symbols contained in the original file, then the reciprocal of storage overhead can be regarded as the code rate R. When the code rate of LCCR is much smaller, LCCR has no advantages on bandwidth overhead. Thus, LCCR with higher code rate should be chosen in distributed storage systems to achieve better performance in bandwidth consumption.

D. REPAIR OF MULTIPLE FAILED LOCAL GROUPS
In this subsection, we discuss repair of multiple failed local groups, although the probability that multiple groups failed at the same time is very small. From the construction of LCCR in Fig. 2, several failed local groups of LCCR can also be recovered simultaneously. First, two failed local groups can be repaired only when these two local groups are spaced by no fewer than two other local groups. Two failed local groups cannot be repaired if they are adjacent or separated by only one local group. For other cases that two local groups failed, up to 8 and at least 6 local groups are required to repair the failed local groups, which repair locality per failed group is up to 4 and at least 3. However, it's important to point out that in Fig. 2, local code C 1 and C m should be regarded as two adjacent local groups. And if local code C 1 and C m are the two failed local groups, this two failed local groups cannot be repaired at the same time.
Three failed local groups cannot be repaired when they are adjacent, or when any two of them are either adjacent or separated by other one local group. Since up to 12 and at least 8 local groups are needed to repair the failed local groups, the locality per failed local group is up to 4 and at least 8 3 on average. Considering the probability that four local groups failed simultaneously is very little, we can omit this case in this paper.
Theorem 4 (Upper Bound): In LCCR, if δ + 2 − 1 ≤ m 3 , the number of failed local groups, that can be repaired cooperatively by other available local groups, is upper bounded by δ + 2 − 1. If not, the number of repairable local groups is upper bounded by m 3 .
Proof: Since LCCR can be regarded as one linear code with code length n = m(r +δ + −1) and minimum distance d min = δ + 2 , at most d min − 1 = δ + 2 − 1 erasures in LCCR can be recovered simultaneously when δ + 2 − 1 ≤ m 3 . If we distribute the δ + 2 − 1 erasures into δ + 2 − 1 local groups, each local group will have one erasure. Then the number of repairable local groups is upper bounded by δ + 2 − 1.
If m 3 < δ + 2 − 1, considering that two failed local groups can be repaired simultaneously only when the two local groups are spaced by no less than two other local groups, the maximum number of repairable local groups is m 3 .
Nevertheless, according to the structure of MSR-local codes and MBR-local codes proposed in [15], MSR-local codes and MBR-local codes have the ability to repair one failed local group, but incapable of repairing two or more failed local groups at the same time. Moreover, for MSRlocal codes and MBR-local codes, it is essential to collect all the remaining local codes and the global parities for repairing one failed local group, which will increase the repair complexity and locality. Consequently, limitations exist for MSR-local codes and MBR-local codes to recover the failed local groups.

V. CONCLUSION
Involving a complex hyperspace of cyber, physical, and social spaces, CPSS have seen significant adoption in the past few years, which are secure by design and exemplify a distributed computing system with high fault tolerance. This paper studies the fault tolerance scheme for distributed storage systems of CPSS, and proposes a framework that can recover multiple failed nodes simultaneously. Concretely, we investigate that in some local groups there exist more than δ − 1 failed nodes, and the corresponding local codes stored need to be recovered as a whole. Particularly, an explicit construction of LCCR based on MSR codes is proposed to repair the failed local groups with lower repair locality, in which the distributed parity symbols of each local code can be generated by the MSR codes in its two adjacent local codes, regarded as a kind of mutual interleaving structure among the local codes. Based on LCCR, the failed local groups can be repaired cooperatively by their adjacent local groups, and the key parameters of LCCR are derived.
The performances of LCCR, MSR-local codes and MBRlocal codes in storage overhead, repair bandwidth overhead and repair locality are discussed respectively. Theoretical and numerical analyses show that, compared with MSR-local codes and MBR-local codes, LCCR has benefits in storage overhead and repair locality, and achieve almost the same tradeoff curve of bandwidth overhead and storage overhead as MSR-local codes and MBR-local codes for a single failed local group. Moreover, LCCR can repair multiple failed local groups simultaneously, except the case that two failed local groups are adjacent to each other, or spaced only by one local group. Associate Professor. In 2009 and 2012, he was a Visiting Faculty Fellow with the Air Force Research Laboratory, Rome, NY, USA. He has authored over 100 refereed articles in journals and conferences. His current research interests include coding theory, signal processing, cryptography, wireless communications, and VLSI implementations of communication and signal processing systems.
KUAN-CHING LI (Senior Member, IEEE) is currently a Distinguished Professor with the Department of Computer Science and Information Engineering, Providence University, Taiwan. He has been actively involved in several major conferences and workshops in program/general/steering conference chairman positions and has organized numerous conferences on high-performance computing and computational science and engineering. Besides publishing numerous research papers and articles, he is the coauthor/co-editor of several technical professional books published by CRC Press/Taylor & Francis, Springer, McGraw-Hill, and IGI Global. His research interests include parallel and distributed processing, GPU/many-core computing, and big data and cloud.
He is also a member of the AAAS, a Life Member of the TACC, and a Fellow of the IET. He received distinguished and chair professorships from universities in China and other countries, and a recipient of awards and funding support from several agencies and high-tech companies.