Efficient Storage Scaling for MBR and MSR Codes

Due to the explosive growth of storage demands, distributed storage systems need to support storage scaling efficiently. Recent work optimizes scaling in a decentralized manner for Reed-solomon coded storage systems. In this paper, we focus on storage scaling for storage systems with regenerating codes and design two efficient scaling algorithms for minimum bandwidth regenerating (MBR) and minimum storage regenerating (MSR) codes. We integrate these two scaling algorithms into Hadoop Distributed File System (HDFS), and the experiments on Amazon EC2 show that the scaling bandwidth can be reduced up to 75% and 43.8% over the centralized scaling.


I. INTRODUCTION
Many distributed storage systems adopt erasure coding (e.g., Reed-solomon codes [22]) against node (or server) failures with low redundancy. Erasure coding is typically constructed by two parameters n and k as an (n, k) code. An (n, k) coded storage system organizes data as fixed-size blocks, and encodes every k blocks (called data blocks) into n − k coded blocks (called parity blocks), such that the original k data blocks can be reconstructed from any k blocks out of the n data and parity blocks.. Erasure coding can significantly achieve higher reliability than replication at the same storage overhead [27], and has been widely adopted in distributed storage systems [8], [15], [23] and practical cloud storage systems, e.g., Azure [15] and Facebook [17].
These storage systems often need to add new storage nodes to increase both storage space and service bandwidth for accommodating the increasing storage demands. Especially for cloud storage systems, they can scale the storage capacity to balance performance and monetary costs. For example, Amazon Web Services provides an in-memory storage service called ElastiCache [2] that can effectively add in-memory nodes to meet increasing user demands based on an automatical scaling mechanism [3]. Therefore, many researchers have focused on storage scaling [13], [16], [19], [28], [29], [31]- [34], in which a storage system re-locates existing stored data and re-computes erasure-coded data The associate editor coordinating the review of this manuscript and approving it for publication was Cristian Zambelli . across all existing and newly added nodes, while maintaining the balanced data layout.
Storage scaling for erasure-coded storage inevitably triggers substantial scaling bandwidth (i.e., the amount of data transferred during scaling) [16], [29], [33], as the scaling process needs to migrate data blocks to different nodes (data migration) and additionally transfer blocks to update parity blocks based on the new data block layout (parity update). We notice that the scaling process is inherently different from the repair process which mainly aims to minimize the repair bandwidth (i.e., the amount of transferred data for repairing a lost block). This is because, although both scaling and repair aim to minimize bandwidth, they build on different settings that cause different analyses of bandwidth. For example, the repair process assumes that the total number of nodes stays the same after repair; however, for the scaling process, the total number of nodes varies after scaling. Figure 1 illustrates a repair process and a scaling process under (4,2) code, and show that the former maintains four nodes while the latter increases from four nodes to five nodes. Note that, the repair process keeps the same parameters (n, k) and the data size of each node, while the scaling process changes parameters (n, k) and the data size of each node, as illustrated in Figure 1. Thus, the scaling process is significantly different from repair process.
Many studies [16], [19], [28], [29], [31], [32], [34] focus on how to perform storage scaling efficiently for traditional erasure codes (i.e., Reed-Solomon codes) to reduce the scaling bandwidth (see Section II in more detail). Some recent FIGURE 1. Differences between repair and scaling. For a file of size M under (4, 2) code, each node has data of size M/2. After repair (Figure 1(a)), each node maintains size of M/2, while after scaling (Figure 1(b)), each node changes its size to M/3. studies [13], [33] find the optimal lower bounds of the scaling bandwidth for Reed-Solomon codes, and design the optimal scaling algorithms. Some storage systems [12], [18] recently adopt regenerating codes (RGCs) [7], which gives an optimal tradeoff curve between storage overhead and repair efficiency, and identifies two extreme points. One extreme point is minimum bandwidth regenerating (MBR) codes while the other is minimum storage regenerating (MSR) codes. However, for regenerating codes which are a family of stateof-the-art erasure codes with optimal repair bandwidth, it remains unexplored how to further minimize the scaling bandwidth for them. This motivates us to find the optimal lower bounds of scaling bandwidth for RGCs, but we find it challenging to extend the optimal storage scaling results for Reed-Solomon codes [13], [33] to those for RGCs. This is because RGCs split each data block into massive fragments, which makes data migration during scaling complicated. More importantly, unlike Reed-Solomon codes whose parity blocks are linear combinations of data blocks, RGCs' parity blocks are complicated combinations of fragments [18], [26], which makes parity update during scaling quite difficult.
In this paper, we study storage scaling of regenerating coded storage systems and minimize the scaling bandwidth for MBR and MSR codes by utilizing the structural properties of these codes and network coding as in [33]. We focus on two kinds of deterministic regenerating codes: E-MBR codes [20] which have a generalized construction scheme of MBR codes, and Butterfly codes [18] which are practical MSR codes deployed in real storage systems (e.g., HDFS and ceph). We also design two corresponding efficient scaling algorithms that achieve the optimal (or near-optimal) scaling bandwidth. This paper mainly focuses on regenerating codes with n − k = 1 and n − k = 2, which are the most common cases of fault tolerances. More parameter settings will be discussed in future work.
Our contributions include: • We present novel scaling methods for E-MBR and Butterfly codes, and utilize parity locally update and features of the two codes to reduce the scaling bandwidth (Section 3 and 4).
• We implement our two scaling methods atop Facebook's HDFS [4] and conduct experiments on Amazon EC2. We show that the scaling bandwidth can be reduced up to 75% and 43.8% compared with traditional scaling methods (Section 5 and 6).
The rest of this paper is organized as follows. Section II surveys the related work of storage scaling schemes. Section III introduces the difference of repair and scaling, and motivates this paper. Section IV and Section V design two effective scaling methods for E-MBR codes and MSR codes respectively. Section VI presents the implementation details of the two scaling methods on HDFS. Section VII shows evaluation results. Finally, we conclude our work in Section VIII.

A. REPAIR FRIENDLY CODES
Previous work have mostly been focused on repair friendly codes which can mitigate the repair bandwidth bottleneck in erasure coding based cloud storage. Locally Repairable Codes (LRC) [14] are a kind of repair friendly codes which can effectively reduce repair bandwidth, and LRC have been demonstrated in cloud storage [15], [23], [30]. Unlike LRCs which requires extra storage overhead, a new family of repair friendly codes -regenerating codes (RGCs) [7], can also reduce repair bandwidth without increasing storage overhead.
Since regenerating codes [7] were proposed, there is a large volume of published studies describing how to minimize the repair bandwidth and storage overhead. Rashmi et al. [21] proposed practical constructions for MBR and MSR codes to repair failed nodes efficiently by using a productmatrix framework. Hu et al. [10] presented a functional MSR code which can significantly reduce the repair bandwidth. Pamies-Juarez et al. [18] presented a practical MSR code called Butterfly code which can achieve the optimal repair performance of MSR codes, but it can only tolerate up to 2 node failures. Vajha et al. [26] presented Clay codes which optimize the repair bandwidth and disk I/O, and it can tolerate multiple node failures. Furthermore, some recent works (e.g., [9], [11], [25] distinguish the repair bandwidth into crossrack and intra-rack bandwidth, and aim to minimize the crossrack bandwidth. However, much of the current literature on regenerating codes pays particular attention to minimizing the repair bandwidth. Our work focuses on minimizing the scaling bandwidth in the storage scaling, and presents efficient storage scaling methods for both MBR and MSR codes.

B. STORAGE SCALING
In recent years, there has been an increasing amount of literature on storage scaling. Many studies address storage scaling in RAID arrays [28], [31], [34], which aims to VOLUME 8, 2020 minimize the data migration while keep the same RAID level. Some previous studies address storage scaling in Reed and Solomon [22] coded storage systems [16], [19]. The most closely related work to ours is by Zhang et al. [33], which applied network coding to Reed-solomon coded storage scaling to minimize the scaling bandwidth, but only support limited scaling cases. Hu et al. [13] generalized the scaling cases in [33]. Since the code construction of traditional erasure codes are totally different from regenerating codes, the above scaling approaches can not directly applied into regenerating coded storage systems. Our work focuses on optimal storage scaling for regenerating coded systems and is complementary with previous Reed-solomon coded storage scaling.

III. BACKGROUND A. REGENERATING CODES
Given an (n, k) code, the repair locality d is introduced for repairing a lost block via connecting to d nodes. Let X i be the i th existing node, where 1 ≤ i ≤ n.

1) MBR CODES
This paper focuses on explicit MBR codes where d = n − 1, called E-MBR codes [24], [21]. For (n, k) E-MBR code with d = n − 1 in [24], the n storage nodes store (n − 1)n blocks, which are composed of k(2n − k − 1)/2 data blocks, (n − k)(n − k − 1)/2 parity blocks and their duplicated ones. The code construction can be modeled via a complete graph, which has n graph vertices which represent all the existing storage nodes and (n − 1)n/2 edges which represent all the blocks. Each vertices stores the blocks which are assigned to the n−1 edges incident on the vertices. Figure 2 illustrates the construction and layout of E-MBR codes with (n, k) = (4, 2) and d = 3. Five data blocks (i.e., 1, . . . , 5), one parity block P (generated by XOR-summing of the five data blocks), and their duplicates, are distributed with the complete graph.

2) MSR CODES
This paper also focuses on an explicit MSR codes which tolerate two failures (i.e., n − k = 2), called Butterfly Codes [18], which has a recursive code construction and only requires XOR operations for encoding/decoding. For (n, k) Butterfly code, its code construction is based on a data block set D k and two parity block sets P k and Q k , where D k is a set of k × 2 k−1 data blocks, and P k and Q k are two sets of 2 k−1 parity blocks. D k can be expressed as a matrix which contains four components: where D 1 k−1 and D 2 k−1 are two (k − 1) × 2 k−2 matrices, and a k−1 and b k−1 are column vectors of 2 k−2 data blocks. P k and Q k are generated via two functions such that P k = P(D k ) and Q k = Q(D k ): where P k represents a matrix where the counter-diagonal elements are one and the rest are zero, such that the leftmultiplication of a matrix by P k flips the matrix vertically. Figure 3 illustrates Butterfly codes with (n, k) = (5, 3). Note that the layout of the (5,3) Butterfly code can be composed of (4,2) Butterfly code and the bold data blocks, which implies the recursiveness of Butterfly codes.

B. REGENERATING CODES BASED STORAGE SCALING
An (n, k, s) scaling process is to transform (n, k)-coded blocks stored in n existing nodes into (n + s, k + s)-coded blocks that will be stored in n + s nodes, including the n existing nodes and the s new nodes. Let Y j be the j th new node after scaling, where 1 ≤ j ≤ s.

1) SCALING PROBLEM
As stated in Section I, the scaling problem aims to minimize the bandwidth for relocating existing data and recomputing new erasure-coded data, which is totally different from the repair. Figure 4 depicts examples of repair and scaling for E-MBR codes, and the repair process keeps the same parameters (4, 2) and the 3 blocks of each node, while the scaling process changes parameters from (4, 2) to (5, 3) and increases the data blocks of each node from 3 to 4. The examples of repair and scaling for Butterfly codes is similar and omitted.

2) CHALLENGE
The storage scaling for regenerating codes becomes challenging, since the complex construction of regenerating codes.
Our goal is to reduce the scaling bandwidth during the scaling process, while preserving the same regenerating code, which is critical for erasure-coded storage systems.

3) MAIN IDEA
We find that E-MBR and Butterfly codes have the following features which can help reduce the scaling bandwidth.
E-MBR (Feature 1:) All the blocks of each node have duplicates across the other nodes (e.g., X 1 's blocks are duplicated across X 2 , X 3 and X 4 in Figure 2), so the blocks of the new nodes can be transferred from the old nodes, while maintaining the layout of the old nodes. (Feature 2:) The parity blocks always stay in the specific nodes (e.g., the parity blocks stay in X 3 and X 4 after scaling in Figure 2(b)), so the parity blocks can be updated locally during scaling similarly to [33].
BUTTERFLY (Feature 1:) The construction process of Butterfly codes is a recursive process, so we can update parity blocks by utilizing the old parity blocks instead of encoding all the data blocks into new parity blocks. (Feature 2:) Both P(D k ) and Q(D k ) contains an element a k−1 ⊕ P(D 1 k−1 ), so we can reduce the scaling bandwidth caused by parity update via generating this element in P k locally before transferring it to Q k .

IV. SCALING FOR E-MBR CODES
This section presents an (n, k, s) scaling approach of E-MBR codes from (n, k) to (n , k ) where s = n − n = k − k, called EMBRScale, which utilizes the features of E-MBR codes to achieve optimal or near-optimal scaling bandwidth.
Based on the E-MBR code construction in Section III-A.1, we define every set of n(n − 1) blocks of (n, k) E-MBR as an old group and every set of n (n − 1) blocks of (n , k ) E-MBR as a new group. We denote the i th old group by G i , and the i th new group by G i . Each node of (n, k) E-MBR (before scaling) has n − 1 blocks of G i , while each node of (n , k ) E-MBR (after scaling) has n − 1 blocks of G i . We also denote the number of data blocks in G i by N D , and the number of data

A. LOWER BOUND
Clearly, the scaling bandwidth is lower bounded by the blocks stored of s new nodes [33]. Based on Section III-A.1, for a new group of (n , k ) E-MBR Codes after scaling, each of s new nodes has n − 1 blocks. Thus the lower bound of the scaling bandwidth per new group formed for E-MBR codes (denoted by γ

B. CENTRALIZED E-MBR SCALING
We first give a centralized scaling approach for E-MBR codes from (n, k) to (n , k ) operated by a controller which is commonly used in RAID arrays [28], [31], [34]. The centralized approach also assumes that all the data blocks maintain their orders [32]. To form a new group, the controller first collects N D data blocks from n old nodes to generate parity blocks, and then sends n − 1 to each of s new nodes. Note that we do not consider the data transfer to reorganize the data blocks for old nodes, so the centralized E-MBR scaling needs to transfer at least: where γ central mbr indicates the necessary scaling bandwidth for centralized E-MBR scaling. Figure 5 illustrates a centralized E-MBR scaling with (n, k, s) = (4, 2, 1). We find that the controller needs to download nine blocks and sends at least four blocks to the new node Y 1 .

C. SCALING APPROACH
The scaling process of EMBRScale borrows the idea in [33] which leverages computational resources of storage nodes to update parity blocks locally in a decentralized way.
To perform scaling, EMBRScale operates on a collection of N D old groups (G 1 , . . . , G N D ) before scaling, which have N D · N D data blocks in total. This collection of blocks can be re-organized into N D new groups (G 1 , . . . , G N D ) after scaling. Similar to [28], for the N D old groups, the first N D old groups are denoted by RR (retained region), in which all the blocks are not migrated during scaling; the remaining N D − N D old groups which have (N D − N D ) · N D data blocks in total are denoted by TR (transferred region), in which all the data blocks are transferred to the new groups. So the data transfer for data migration is ns + 2(N D − N D − ns) = s(n − 1) data blocks, which satisfies that each of s new nodes has n − 1 data blocks after scaling. Figure 6 shows that EMBRScale with (n, k, s) = (3, 2, 1) transfers three data blocks from the three old nodes X 1 , X 2 and X 3 to the new node Y 1 for data migration. Figure 7 shows that EMBRScale with (n, k, s) = (4, 2, 3) transfers 15 data blocks from the four old nodes X 1 , X 2 , X 3 and X 4 to the new nodes Y 1 , Y 2 and Y 3 for data migration. EMBRScale does nothing. 3: else if n − k = 2 and N D | N D then 4: EMBRScale uses N D old parity blocks from N D old groups to compute the new parity block and its duplicate locally. 5: else if n − k = 2 and N D N D then 6: EMBRScale sends (N D − N D ) data blocks in TR to each of the nodes which has parity block to compute new parity blocks. 7: end if Algorithm 2. EMBRScale updates parity blocks in three different cases, specified as follows. We denote the scaling bandwidth of EMBRScale per new group by γ case1 mbr , γ case2 mbr , γ case3 mbr , respectively.
The illustration for Case 3 is omitted due to limited space. Theorem 1: EMBRScale achieves the optimal scaling bandwidth in Case 1 and 2.
Proof: In Case 1 and 2, we find that the scaling bandwidth of EMBRScale (see Equation (6), (7)) are equal to the lower bound of the scaling bandwidth for E-MBR codes (see Equation (4)). Thus, the theorem holds. Figure 8 illustrates the flowchart of EMBRScale for the i th (1 ≤ i ≤ N D ) new group G i . In general, step 1 and 2 leverage feature 1 and 2 of E-MBR codes (see Section III-B.3) respectively. In this way, for data migration, we can just transfer data blocks to the new nodes without re-organizing the data blocks among old nodes; for parity update, we can update parity blocks locally. Therefore, EMBRScale achieves the optimal scaling bandwidth in Case 1 and 2 (see Theorem 1), but does not achieve the optimal scaling bandwidth in Case 3 due to the cross-node bandwidth for parity updates(see Equation (4), (8)). We pose the optimization of Case 3 as future work.

D. NUMERICAL ANALYSIS
We compare the scaling bandwidth of EMBRScale with the centralized E-MBR scaling. Based on Equation (4), (5), (6), (7) and (8), the scaling bandwidths are calculated as the total number of blocks transferred during scaling per new group. Figure 9 shows that the scaling bandwidth of EMBRScale can be reduced by up to 75% over the centralized E-MBR scaling when (n, k, s) = (5, 4, 1).

V. SCALING FOR BUTTERFLY CODES
This section presents an (n, k, s) scaling approach of Butterfly codes from (n, k) to (n , k ) where s = n − n = k − k, called ButterflyScale, which utilizes the features of Butterfly codes to reduce the scaling bandwidth.
Based on the Butterfly code construction in Section III-A.2, we also define every set of the blocks of D k and its parity blocks before scaling as an old group and every set of the blocks of D k+s and its parity blocks after scaling as a new group. We denote the i th old group by G i , and the i th new group by G i , similar to EMBRScale.

A. LOWER BOUND
Since Butterfly codes belong to Maximum Distance Separable (MDS) codes [7], the scaling bandwidth for Butterfly codes (denoted by γ opt msr ) per new group is lower bounded by which is given by [13]. VOLUME 8, 2020

B. CENTRALIZED BUTTERFLY SCALING
Similar to E-MBR scaling, we also give the centralized scaling approach for Butterfly codes from (n, k) to (n , k ) via a controller. To form a new group, the controller first downloads k · 2 k −1 data blocks to generate 2 · 2 k −1 parity blocks, sends 2 k −1 data blocks to each of s new nodes for data migration, and sends 2 · 2 k −1 parity blocks to two nodes for parity update. Note that we do not consider the data transfer to re-organize the data blocks for old nodes, so the centralized Butterfly scaling needs to transfer at least: where γ central msr indicates the necessary scaling bandwidth for the centralized Butterfly scaling. Figure 10 illustrates a centralized Butterfly scaling with (n, k, s) = (3, 2, 1). We find that the controller needs to download 12 blocks (i.e., 1, . . . , 12) and sends 8 parity blocks to X 3 and X 4 , and sends 4 blocks (i.e., 3,6,9,12) to the new node Y 1 .

C. SCALING APPROACH
Similar to EMBRScale, ButterflyScale also uses the computational resources of storage nodes to update parity blocks locally in a decentralized way. Due to the complexity of code construction of Butterfly codes, we only focus on the scaling case of s = 1 (i.e., adding a new node), and pose the scaling case of s > 1 as future work.
To perform scaling, ButterflyScale operates on a collection of 2k + 2 old groups (G 1 , . . . , G 2k+2 ) before scaling, each of which has 2 k−1 · k data blocks. This collection of blocks can be re-organized into k new groups (G 1 , . . . , G k ) after scaling, each of which has 2 k · (k + 1) data blocks. To form the i th (1 ≤ i ≤ k) new group G i , ButterflyScale operates on the data blocks of the groups G 2k+1 and G 2k+2 at Algorithm 3 Data Migration of ButterflyScale 1: X i (1 ≤ i ≤ k) sends 2 k−1 data blocks of the groups G 2k+1 to the new node. 2: X i (1 ≤ i ≤ k) sends 2 k−1 data blocks of the groups G 2k+2 to the new node.
node X i , and the two old groups G 2i−1 and G 2i , described by the following two steps, similar to EMBRScale.

STEP 1 (DATA MIGRATION):
ButterflyScale sends the data blocks to the new nodes in the scaling process, as shown in Algorithm 3. It lets node X i transfer its data blocks which are in the groups G 2k+1 and G 2k+2 to the new node according to the layout of Butterfly codes. So the data transfer for data migration is 2 × 2 k−1 (see Section III-A.2), which satisfies that the new node has 2 k data blocks after scaling. Figure 11 shows that to form the new group G 1 , Butter-flyScale with (n, k, s) = (4, 2, 1) lets node X 1 transfer data blocks 17, 19, 21, 23 in G 5 and G 6 to the new node Y 1 for data migration.

STEP 2 (PARITY UPDATE):
ButterflyScale does the parity updates from the two old parity block sets P k and Q k to the two new parity block sets P k+1 and Q k+1 , specified as follows. Based on the code construction of Butterfly codes in Section III-A.2, we have Based on Equation (2) and (11), we can infer that the parity block set P k+1 is By Equation (12), we find that the first new parity block set P k+1 can be generated by XOR-summing of the data blocks of the bold part (a k and b k ) as well as the two old parity block sets P(D 1 k ) and P(D 2 k ) of P k . Thus, to update parity from P k to P k+1 , ButterflyScale only needs to transfers 2 × 2 k−1 (corresponding to a k and b k ) (13) data blocks to the node where P k resides in.

Algorithm 4 Parity Update of ButterflyScale
1: ButterflyScale sends the 2 k data blocks (a k , b k ) from the new node to P k . 2: ButterflyScale sends 2 k−1 blocks (P k b k ) from the new node to Q k . 3: ButterflyScale sends 2 k−1 blocks (P k (a k ⊕P(D 1 k ))) from P k+1 to Q k . 4: ButterflyScale sends (k − 1) × 2 k−1 blocks (P k−1 P(D 2 k−1 ) and P k−1 P(D 1 k−1 )) from old data nodes to Q k . 5: ButterflyScale uses the received blocks to update the P k , Q k to P k+1 , Q k+1 Based on Equation (3) and (11), we can infer that the parity block set Q k+1 is . (14) By Equation (14), we find that the second new parity block set Q k+1 can be generated by XOR-summing of the data blocks of the bold part, as well as the two old parity block sets Q(D 1 k ) and Q(D 2 k ) of Q k . Thus, to update the parity block set from Q k to Q k+1 , ButterflyScale only needs to transfer (corresponding to P k b k ), 2 k−1 (corresponding to P k (a k ⊕P(D 1 k ))), (k − 1) × 2 k−2 (corresponding to P k−1 P(D 2 k−1 )), (k − 1) × 2 k−2 (corresponding to P k−1 P(D 1 k−1 )), (15) blocks to the node where Q k resides in. Note that the coded blocks a k ⊕P(D 1 k ) can be transferred from P k+1 to Q k+1 . We describe the parity update step in Algorithm 4. From Step 1 and 2 (Equation (13) and (15)), the scaling bandwidth of ButterflyScale (denoted by γ butterflyscale msr ) per new group formed after scaling is: Figure 11 illustrates ButterflyScale with (n, k, s) = (4, 2, 1), where six old groups G 1 , . . . , G 6 are changed into two new groups G 1 , G 2 . For example, G 1 can be formed by first transferring four data blocks of G 5 and G 6 at node X 1 (i.e., 17,19,21,23) to the new node Y 1 , then transferring four data blocks (i.e., 17,19,21,23) to X 3 to update P, and last transferring six blocks (i.e., 5, 7, 21, 23, (19 + 4 + 3), (17 + 2 + 1)) to X 4 to update Q. Figure 12 illustrates the flowchart of But-terflyScale for the i th (1 ≤ i ≤ k) new group G i . In general, step 2 leverages feature 1 and 2 of Butterfly codes (see Section III-B.3). In this way, ButterflyScale can use old local parity blocks to generate new parity blocks in a  recursive and decentralized manner, thereby reducing the scaling bandwidth caused by parity update.

D. NUMERICAL ANALYSIS
We compare the scaling bandwidth of ButterflyScale with the centralized Butterfly scaling. Based on Equation (9), (10) and (16), the scaling bandwidths are calculated as the total number of blocks transferred during scaling per new group. Figure 13 shows that the scaling bandwidth of ButterflyScale can be reduced by up to 43.8% over the centralized E-MBR scaling when (n, k, s) = (6, 4, 1).

A. FACEBOOK's HDFS OVERVIEW
Facebook's HDFS has a single NameNode which managing metadata and multiple DataNodes which storing data, and it integrates HDFS-RAID [5] as a layer atop the original HDFS to support the erasure coding operations. HDFS-RAID VOLUME 8, 2020 brings a RaidNode to handle the erasure coding operations and manage erasure coded blocks. Take the encoding operation for a stripe for example, the RaidNode first obtains the data blocks of the stripe from several DataNodes by using the metadata which are received from the NameNode. Then the RaidNode encodes these data blocks to generate parity blocks, and distribute these blocks to DataNodes.

B. INTEGRATION
We now explain how we integrate EMBRScale and Butter-flyScale into Facebook's HDFS. We first implement E-MBR codes and Butterfly codes into HDFS-RAID as the preparation before scaling, and then we implement the scaling operations on these two types of codes. We describe the details of the implementation as follows: ERASURE CODES: For the computation part of the scaling operation and the encode/decode operation of the E-MBR codes and the Butterfly codes, we implement it in C++ based on the Intel's ISA-L [6], and this computation part is connected to HDFS through Java Native Interface (JNI). We mainly use the ec_init_tables API of ISA-L to initial the coding matrix and use the ec_encode_data API of ISA-L to perform the coding operations.

PLACEMENT ALGORITHMS:
The original HDFS-RAID only supports random distributions of data and parity blocks, whereas each of the data and parity blocks in the E-MBR codes and the Butterfly codes need to be placed into a specified node. Thus, we add the placement algorithms into the RaidNode to ensure these two codes are implemented correctly.

SCALING PROCESSES:
Both of the two scaling processes need to transfer multiple blocks among multiple nodes. So we modify the RaidNode to collect the metadata of the blocks which are participated in the scaling operations from NameNode and then distribute the metadata to the DataNodes. We also modify the DataNodes to execute the specific scaling operations with the received metadata, such as the operation of transfer blocks and the operation of update parity blocks.

VII. EVALUATION
In this section, we present evaluation results for the scaling performances of EMBRScale and ButterflyScale. We compare EMBRScale with the centralized E-MBR scaling (denoted by E-Central), and we also compare ButterflyScale with the centralized Butterfly scaling (denoted by B-Central).

A. CLOUD EXPERIMENTS SETUP
We conduct the experiments on a cluster of 8 machines. Each machine is a m4.4xlarge instance located in the US East (North Virginia) region in Amazon EC2 [1]. We assign a dedicated instance to serve both the HDFS NameNode and the HDFS RaidNode. To prevent the variation of network in the cloud impacts the performance of the experiments, We need to maintain a stable network in the cluster. We first assign a dedicated instance to mimic a network core, such that any traffic between every pair of instances must traverse the dedicated instance. We then control the outgoing bandwidth of the dedicated instance by using the Linux traffic control command tc. Such that, we can maintain the specific bandwidth in the cluster.
We measure the scaling performances for EMBRScale and ButterflyScale, and the scaling performance is measured by the scaling time per 1 GB of data. To obtain a stable scaling performance, we execute the scaling processes on around 1000 data blocks and the corresponding parity blocks.

B. EXPERIMENT 1: IMPACT OF BANDWIDTH
We first evaluate the scaling performances of EMBRScale and ButterflyScale under different bandwidth settings. We fix the block size as 64 MB, and we vary the cluster bandwidth from 200 Mb/s up to 2 Gb/s. Figure 14 shows the scaling time results of EMBRScale and the centralized E-MBR scaling (E-Central). We measure 3 parameter settings from 3 cases of EMBRScale, and compared them with the cases of E-Central. We find that the empirical results in Figure 14 are consistent with the numerical results in Figure 9, and EMBRScale reduces the scaling time of the E-Central. Take the E-MBR scaling with (n, k, s) = (3, 2, 1) for example, EMBRScale incurs 66.67% less scaling bandwidth than the E-Central as shown in the numerical results, while in the evaluation results, EMBRScale incurs 66.49%, 66.29%, 66.1% and 65.65% less scaling time than E-Central when the cluster bandwidth is 200 Mb/s, 500 Mb/s, 1 Gb/s, and 2 Gb/s, respectively. Note that the ratio of the scaling time between EMBRScale and E-Central is slightly decreased when the bandwidth is higher,  mainly because the HDFS RaidNode collect metadata from the NameNode takes a fixed time, and this takes a higher proportion in the scaling time when the cluster bandwidth is higher. Figure 15 shows the scaling time results of ButterflyScale and the centralized Butterfly scaling (B-Central). Similar to Figure 14, We also find that the empirical results in Figure 15 are consistent with the numerical results in Figure 13, and the ButterflyScale reduces the scaling time of the B-Central.

C. EXPERIMENT 2: IMPACT OF BLOCK SIZE
We now evaluate the scaling time versus the block size. We fix the cluster bandwidth as 1 Gb/s, and we vary the block size from 1 MB to 64 MB. We depict the evaluation results in Figure 16. We find that the scaling times of E-MBR codes and Butterfly codes are stable across different block sizes, and the EMBRScale and the ButterflyScale reduce the scaling time of the centralized scaling in all cases.

VIII. CONCLUSION AND FUTURE WORK
We present scaling approaches for two regenerating codes: E-MBR with n − k = 1 and n − k = 2, and Butterfly codes with s = 1. We utilize the network coding and the features of these two codes to optimize the scaling processes, and both of the approaches reduce the scaling bandwidth compared to the centralized scaling approaches. In the future, we will focus on the optimization of the scaling processes for E-MBR codes with n − k > 2, and Butterfly codes with s > 1.