An Erasure-Coded Storage System for Edge Computing

Emerging computing paradigm edge computing expects to store and process data at the network edge with reduced latency and improved network bandwidth. To the best of our knowledge, key performance issues such as coding performance of erasure-coded storage systems haven’t been investigated for edge computing. In this paper, we present an erasure-coded storage system for edge computing. Unlike the data center and cloud storage systems, it employs edge devices to perform encoding and decoding operations, which can be a performance bottleneck of the whole storage system due to limited computing power. Hence, we present a comprehensive study of the performance of erasure coding to see if it can match the network performance of 5G and Wi-Fi 6 at the network edge. We use the popular edge device Jetson Nano and two state-of-the-art coding libraries: Jerasure and G-CRS. Our evaluation results reveal unsatisfied performance for Jerasure and high variance for G-CRS. To obtain better and stable performance, we accelerate erasure code with OpenMP on a multi-core CPU. Our work demonstrates our acceleration can bring stable performance and match the network bandwidth of 5G and Wi-Fi 6 for some commonly used cases. Besides, our work offers a better understanding of erasure-coded storage systems for edge computing and can be served as a reference to any further optimization for such kind of systems at the network edge.


I. INTRODUCTION
The proliferation of Internet of Things introduces the generation of zillions bytes of data by mobile and IoT devices [1]. According to the prediction of International Data Corporation (IDC), around 70% of data generated by IoT will be processed at the network edge by 2025 [2], [3]. The latency can be reduced to 1 ms and the network is expecting to reach up to 10 Gbps with 5G [4] and Wi-Fi 6 [5] technologies. As a result, the emerging computing paradigm edge computing requires to store and process data at the network edge for real-time processing with low latency and high throughput.
As a replacement for triplication-based storage systems to bring lower storage overhead and higher reliability, erasure-The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei . coded storage systems have earned great popularity in the data center and cloud storage due to data explosion. The examples are Microsoft's cloud storage system Azure [6] and Facebook's Web service storage system f4 [7]. A recent erasure-coded storage system prototype ESetStore [8] was designed for the data center and cloud platforms to have fast data recovery. Meanwhile, famous open-source file systems such as HDFS [9] and Ceph [10] also support erasure coding to yield high reliability and low storage overhead. However, to the best of our knowledge, how to build an erasure-coded storage system for the computing paradigm of edge computing with coding performance taking into consideration is still a question. Unlike data center and cloud systems, edge computing needs flexible and easy-deployed devices to perform computing tasks in the mobile environment. The handsized device Jetson Nano is a good candidate for performing VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ computing tasks in network edge with both CPU and GPU computing paradigm involved [11]. However, its computing power is no more than 0.5 TFlops. This indicates the limited computing power of each computing device deployed for systems at the network edge.
Here we use three parameters: k, m and n to describe an erasure-coded storage system, where k+m is equal to n. Initially, a file is divided into k equally-sized data blocks and the storage system uses the encoding operation to generate m equally-sized parity blocks. The n blocks together, which contains k data blocks and m parity blocks, are called a stripe. When no more than m blocks lost in the stripe, the decoding operation can use any remaining k blocks to reproduce any missing blocks to restore the missing file. The erasure codes that can restore m missing blocks from any available k blocks from the same stripe have the hisheset error correction capability and are called Maximum Distance Sparable(MDS) codes [12]. When using edge devices to perform encoding and decoding operations, it can be a key performance bottleneck of the whole storage system.
In this paper, we present an erasure-coded storage system for edge computing. We study the performance issue of erasure coding at the network edge by using two state-of-the-art erasure coding libraries Jerasure [13] and G-CRS [14] and run them on the edge device Jetson Nano. Jerasure is a library written in C programming language. It can be compiled and executed on modern CPUs like Intel, AMD, and ARM with no modification. It is adopted by HDFS [9] and Ceph [10] to perform erasure coding. So far, G-CRS is the fastest Cauchy Reed-Solomon coding library on GPUs for data center and cloud systems. As a result, we can use Jerasure and G-CRS to represent the encoding and decoding performance of erasurecoded storage systems at the network edge.
The emerging 5G [4] and Wi-Fi 6 [5] technologies will push the peak bandwidth of the wireless network from 1 Gbps to 10 Gbps [15] at the network edge. This will accelerate the data exchange rate greatly between peer to peer devices for edge computing. As a result, when deploying erasure-coded storage systems at the network edge, the main challenge of erasure coding is to saturate the network bandwidth of 5G and Wi-Fi 6. Our initial evaluation results demonstrate unsatisfied performance for Jerasure and high variance for G-CRS. This motivates us to present a solution of parallel erasure coding on multi-core CPU with OpenMP.
To the best of our knowledge, this is the first work to study the construction of erasure-coded storage systems for edge computing with coding performance taking into consideration. We investigate if the performance of erasure coding can match the network bandwidth 5G and Wi-Fi 6 at the network edge. Our evaluation results demonstrate Jerasure may not able to saturate the bandwidth due to low coding throughput. Meanwhile, G-CRS suffers unstable performance even its average throughput can match the network bandwidth of 5G and Wi-Fi 6. This leaves the acceleration of erasure coding on multi-core CPU with OpenMP [16] to become a better choice for some cases. Our work also offers a better understanding of erasure-coded storage systems for edge computing and can be served as a reference to further optimization of such systems at the network edge.
In summary, we make the following contributions: 1) We present the design of an erasure-coded storage system for edge computing, introducing reduced storage overhead and maintaining high reliability at the network edge. 2) We conduct extensive experiments to validate whether the performance of erasure coding can match the network bandwidth 5G and Wi-Fi 6 on edge device Jetson Nano with two state-of-art erasure coding libraries: Jerasure and G-CRS. 3) To bring better and stable performance for erasure coding at the network edge, we design and implement a parallel solution on multi-core CPU with OpenMP 4) Our work offers a better understanding of erasurecoded storage systems for edge computing and can be served as a reference to any further optimization for such systems at the network edge.
The remainder of this paper is organized as follows. Section II introduces the background and some related work. Section III presents an erasure-coded storage system for edge computing. We give the design and implementation of parallel erasure coding on multi-core CPU with OpenMP in Section IV. Section V presents the main evaluation results of erasure coding. We conclude the paper in Section VI.

II. BACKGROUND AND RELATED WORK
In this section, we first briefly introduce the key concepts of edge computing and erasure-coded storage systems. Then, we identify the opportunity and challenge of deploying erasure-coded storage systems at the network edge. Finally, we introduce some related work.

A. EDGE COMPUTING
The proliferation of the Internet of Things (IoT) introduced a new computing paradigm, which calls for processing the data at the network edge [1], [2], [17]. At the edge of the network, mobile devices such as mobile phones, cameras, and vehicles are the major sources for data generation. A recent study reveals more than 800 ZB will be generated at the network edge by 2021 [1]. However, it is unable timely to upload such a huge amount of data to global data centers due to limited network traffic between edge and data centers [1]. Meanwhile, processing data at the network edge will reduce latency and improve bandwidth greatly compared with the data center and cloud storage. In summary, storing and processing data at the network edge becomes necessary.
Offloading computing tasks in edge computing is quite different from computing in data centers and cloud computing [18]. Because the major data sources are mobile devices, the network connection between peer to peer devices is mainly through the wireless network. It requires computing devices near the mobile device that offloading any computing task to reduce network hops. The landscape of edge computing puts many restrictions on computing nodes. This indicates computing architecture designed for data center and cloud systems cannot be applied to the network edge [18]. Table 1 presents a rough comparison between cloud computing and edge computing. While the network bandwidth can reach up to 100 Gbps with the wired network, the bandwidth at the network can reach up to 10 Gbps with 5G and Wi-Fi 6 technologies. However, the computing power is only one to a hundred for a computing device in the edge environment compared with cloud computing in general. Although some devices may have the same computing power as the one for cloud computing, it requires a higher price and greatly increases the cost when building edge computing. As a consequence, the key challenge of offloading computing tasks to edge computing is the limited computing power for easydeployed, small size devices. The performance with limited computing power can be a bottleneck for systems. Figure 1 illustrates how a file is stored to an erasure-coded storage system. A file is divided into k equally-sized data blocks. Then it uses encoding operation to produce n blocks, which contains a set of k data blocks and a set of m parity blocks. Then the set of n blocks are stored into n storage devices to tolerate at most m device failures. When retriving the file from these storage devices and no more than m data blocks become unavailable, we can gather any available k blocks out of these n blocks and use the decoding operation to restore any missing block.

B. ERASURE-CODED STORAGE SYSTEMS
The encoding and decoding operations are computingintensive tasks. Meanwhile, they are performed when writing data to a storage system and when the system needs to restore any missing data. If they become the performance bottleneck for the system, the whole system performance will be affected greatly. Here we introduce two well-known erasure codes: Reed-Solomon codes [19] and Cauchy Reed-Solomon codes, which are MDS codes [12], to further illustrate how encoding operation is performed. The decoding operation is similar to the encoding operation as it uses k blocks as an input. Thus, we also call both encoding and decoding as erasure coding or coding in this paper.

1) REED-SOLOMON CODES
Reed-Solomon(RS) Codes are the first generation of erasure codes and have the longest history. Figure 2 demonstrates the  encoding operations of RS codes from the programming's view. A file with L bytes of data is divided into k equallysized data blocks. We regard the data in each block as a set of w-bit elements, where w is a integer value. On the left side, there is a m rows and k columns matrix served as the Input Matrix. The encoding operation generates a w-bit element of m parity blocks by multiplying one row from the Input Matrix with one column of k w-bit elements from the k data blocks. m w-bit elements are generated by multiplying the Input Matrix with one column of k w-bit elements from the k data blocks. In this way, m parity blocks are generated by multiplying the Input Matrix with the k data blocks.
When performing decoding operations, the number of rows in the Input Matrix is equal to the number of blocks that are needed to be restored. The Input Matrix multiple with k blocks selected from available data blocks and parity blocks to generate at most missing m blocks. Essentially, from the programming' perspective, the encoding and decoding operations are the same operations.
The elements from x 0,0 to x m−1,k−1 are values between 0 to 2 w − 1, it uses Galois Field arithmetic GF(2 w ) to perform addition, multiplication and division [20]. The value of w must satisfy the condition that 2 w + 1 ≥ n, where n is equal to k+m. The value of w is set to be one from 8, 16, 32 and 64 to calculate efficiently in computers. In the Galois Field arithmetic GF(2 w ), the addition is XOR operation. The multiplication is a complex operation and is implemented with multiplication tables or discrete logarithm tables [21]- [24]. The multiplication operation is an expensive operation and greatly limits the performance of RS codes.

2) CAUCHY REED-SOLOMON CODES
To eliminate the expensive multiplication operations in Galois Field arithmetic GF(2 w ), Cauchy Reed-Solomon(CRS) VOLUME 8, 2020 Codes uses a w*w matrix to replace each element in Figure 2. As illustrated in Figure. 3, the bitmatrix has mw rows and kw columns. Each element in the bitmatrix is either 1 or 0. In this way, only XOR operations exist in CRS coding, which can be performed efficiently on both CPU and GPU.
When dividing a file into k data blocks for CRS coding, as demonstrated in Figure. 3, each block contains many sets of w packets. A packet is a fixed-length of data, we use packet size to describe the length of a packet. The generation of a packet requires at most kw times of XOR operations. To perform XOR operation efficiently, the packet size is set to a multiply of eight bytes, which is the size of type long in modern computers. In general, the value of w determines the number of XOR operations. Besides, when implementing CRS coding on CPU, the packet size decides whether the erasure coding can utilize the CPU cache efficiently. Thus, the parameter w and packet size are the key impact factors for the performance of CRS coding.
In general, erasure coding of CRS codes presents better performance than RS codes. As a consequence, we use CRS codes to study the performance of erasure coding with two erasure coding libraries: Jerasure and G-CRS in this work.

C. OPPORTUNITY AND CHALLENGE
Erasure codes have gained popularity in data center and cloud storage systems due to data explosion. Many studies have been made to optimize erasure coding for data center and cloud storage's CPU and GPU architecture as the performance of erasure coding can be a key performance bottleneck for erasure-coded storage systems. The shifting of data producers from data centers and cloud environments to the network edge brings new storage environment and architecture. As erasure coding is a compute-intensive work, it can also be a bottleneck for erasure-coded storage systems for edge computing and require a comprehensive study of its performance for edge computing.
Initially, erasure coding was mainly studied and optimized in the data center and cloud storage systems. The performance of erasure coding in edge devices like Jetson Nano was not investigated. While erasure coding can match 100 Gb/s network bandwidth in data center and cloud storage systems, whether it can match the 5G and Wi-Fi 6 network bandwidth in edge devices is still unknown. Jerasure and G-CRS are two erasure coding libraries written in C. They can be compiled and executed in edge devices with no extra effort. Thus, they are the proper candidates for studying the performance of erasure coding on edge. Noting that in this paper, we set the throughput of coding, which refers to encoding and decoding, as the input data size divided by the time required to perform a coding task.
In summary, when providing low storage overhead and high reliability for edge computing with erasure-coded storage systems, the performance of erasure coding is crucial to meet the network bandwidth of 5G and Wi-Fi 6 at the network edge. This is one main challenge of building erasurecoded storage systems for edge computing and presents an opportunity to optimize the performance of erasure coding.

D. RELATED WORK
We introduce some related work of edge computing and the optimization of erasure coding for erasure-coded storage systems in this subsection. Edge computing and erasure coding are two different topics in research and industry area. The data storage will put them together when reducing storage overhead becomes necessary at the network edge.

1) EDGE COMPUTING
As more and more data are produced at the edge of the network, some studies proposed that it will be more efficient to process the data at the network edge [1], [2], [17] with improved bandwidth and reduced latency. In the study of computation offloading to edge computing, it summarizes the main benefits are the reduced latency and increased network bandwidth [18]. Many applications will be deployed at the edge of the network. VideoEdge is proposed and implemented to obtain live video feeds for video analytics on edge [25]. sFOG provides a general IoT framework for building smart things in the mobile environment [26]. Edge Intelligence presents a comprehensive study for the challenge and opportunity of performing machine learning, building AI systems and applications in edge computing [1].
As edge computing is built for the mobile environment, edge devices for performing computing tasks are quite different from the data center and cloud computing platforms [18]. This leads to many studies to investigate resource scheduling, resource allocation and task allocation for edge computing [27]- [29]. pcamp studies the performance of conducting machine learning in edge computing by using three candidates of edge devices: MacBook, FogNode and Jetson TX2 [30]. Edge AIBench is an end-to-end AI benchmarking for edge computing [31].

2) OPTIMIZATION ON ERASURE CODING
As a key performance issue of erasure-coded storage systems, many studies have been made to study and improve the performance of erasure coding from various aspects. We simply classify them into three categories. One improves the performance by reducing calculations in erasure coding [32], [33]. Some make efficient utilization of CPU hardware to accelerate erasure coding [21], [34]. The rest use parallel computing techniques to speed up erasure coding on modern acceleration hardware: CPUs and GPUs.
For reducing XOR operations for erasure coding, a pioneer work optimizes the Cauchy distribution matrix to bring better coding performance for CRS coding [33]. Some heuristic methods are proposed to reduce XOR operations for decoding meanwhile reading less data for recovery [35], [36]. The Locally Repairable Codes can reduce half of XOR operations for single recovery when performing decoding [37], and it is adopted by Microsoft cloud storage system Azure [6]. A recent work presents a comprehensive study of existing acceleration techniques for erasure coding and uses simulated annealing and genetic algorithms to generate the bitmatrix with reduced XOR operations for coding [32].
The main ways of accelerating erasure coding with CPU are through reducing cache misses and vectorization. A study [34] presents how data are loaded to CPU and analyzes the cache miss rate of different ways of loading data to CPU. It increases the spatial data locality to improve the cache hit rate to speed-up the performance of erasure coding. A work improves the performance of RS coding by streaming Galois finite fields arithmetic using intel SIMD instructions. It presents how to vectorize the operations in erasure coding with intel's SSE instruction [21]. The streaming technique takes advantage of long instruction to achieve better performance. In this way, the 256-bit AVX2 vectorization instructions and 512-bit AVX-512 vectorization instructions on Intel and AMD's CPUs can be used for better acceleration [32].
The parallel techniques can be used for erasure coding on multi-core CPUs and many-core GPUs. For multi-core CPUs, CRS codes have been parallelized in [38] by assigning the tasks of generating different parity blocks to different CPU cores. There are also work for paralleling EVENODD and RPD codes on CPUs [39], [40]. The recent increase in the performance GPU motivates many parallel works on GPUs. Gibraltar is a coding library for RS codes on GPUs [24]. PErasure [41] and G-CRS [14] are two coding libraries for the acceleration of CRS codes on GPUs.
While G-CRS outperforming other state-of-the-art coding libraries for data center and cloud computing platforms [14], it is worth an investigation on the performance of erasure coding on edge devices such as Jetson Nano. As the edge devices use different CPU architecture and reduced GPU cores. This motivates us to study the performance of erasure coding at the network edge and parallel erasure coding on CPU with OpenMP when considering building an erasurecoded storage system for edge computing.

III. AN ERASURE-CODED STORAGE SYSTEM FOR EDGE COMPUTING
While edge computing requires to store and process data at the network edge, existing storage solutions mainly store their data in data centers and cloud platform. In this section, we present an erasure-coded storage system for edge computing. We first present an overview of our design. Then, we present some candidates of edge devices and make a study on the performance of erasure coding on Jetson Nano with the impact of parameter w and packet size. Fig. 4 presents an overview of our erasure-coded storage system for edge computing. There are four components for the storage system built at the network edge: data producer, data accesser, edge devices, and storage cluster.

1) DATA PRODUCER
The data producer can be various kinds of mobile devices and sensors who generate data at the network edge. The producer can be video cameras who performing 24-hour monitoring, vehicles that constantly generating data by their sensors and mobile phones that producing images and videos. These devices are producing the data, but they don't have the computing power leave for performing erasure coding tasks, because when using them to perform erasure coding tasks will harm their service ability greatly. For example, when using a mobile phone for coding, it can't give an in-time response to the user's interaction. And some devices like cameras don't have the computing power to perform a calculation in themselves. As a result, edge devices for computing will be employed to perform encoding work when the data producer uploads its data to the storage cluster.

2) DATA ACCESSER
The data accesser are devices that retrieve the data stored in the storage cluster. Any device that can download data through the network can be a data accesser. The data accesser can access the storage cluster directly to get the required data when there is no missing data, which is called Normal Read. However, when some storage servers failed in the storage cluster, which can be happened constantly accordingly to existing studies, the decoding operation may be required before sending any missing data to the accesser, which is the one named Degrade Read in the figure. In this case, we need edge devices to perform decoding operations for the data accesser. VOLUME 8, 2020

3) EDGE DEVICES
Edge devices are computing devices that are deployed between the data producer, the data accesser and the storage cluster. Any computing task is offloaded to edge devices for execution. When the data producer uploads any data to the storage cluster, it sends the data to an edge device to perform encoding operation, then the edge device stores the data to the storage cluster. When decoding is required by the data accesser, an edge device retrieves the required data from the storage cluster, then it performs decoding and sends the restored data to the data accesser. As edge devices are involved in read and write operations of the erasure-coded storage system, their encoding and decoding performance is crucial to the whole storage system.

4) STORAGE CLUSTER
The storage cluster consists of many storage servers to store the data generated at the network edge. Traditionally, a storage cluster is deployed in the data center and cloud computing platforms. But to reduce the access latency and improve the network bandwidth between the data producer, data accesser and the storage cluster, it must be deployed at the network edge. The storage cluster can be built by an Internet Service Provider and a cloud service provider who is building an edge cloud. It can also be a peer-to-peer storage system at the network edge.
When compared with erasure-coded storage systems for the data center and cloud storage, the main difference for an erasure-coded storage system for edge computing is that it adoptes edge devices to perform encoding and decoding operations as we revealed in Figure 4.
The data producer, data accesser, edge devices, and the storage cluster are communicated through the wireless network. The available state-of-the-art wireless network technologies are 5G and Wi-Fi 6. As a result, the maximum throughput should be achieved by storage systems for the communication is the achievable bandwidth of 5G and Wi-Fi 6. When the throughput of encoding and decoding operations on edge devices cannot match the bandwidth of 5G and Wi-Fi 6, the erasure coding will be a performance bottleneck of our presented erasure-coded storage system at the network edge. As a result, the performance of erasure coding on edge devices is crucial to the whole erasure-coded storage system and worth a comprehensive study.

B. A STUDY ON EDGE DEVICES
Edge devices are deployed in the mobile environment to perform coding tasks assigned by the data producer and the data consumer. We summarize two restrictions on edge devices. First, it should have the computing power to perform any assigned tasks. Besides, it can be deployed near mobile devices to reduce latency and improve network bandwidth. We list four candidates for edge devices in Table 2. They are hand size devices that can be easily deployed in the mobile environment. The price is ranged from 39$ to 399$ with the performance from less than 1 TFlops to 21 TFlops.
Taking computing power and the price as two factors for comparison, Jetson Xavier NX requires 19$ per TFlops. It should be able to provide over 100 GB/s coding throughput based on the results of G-CRS on some other GPU devices who have less computing power than Jetson Xavier NX. Since edge devices are required to deployed near mobile devices, for a city around 1000 km 2 , we may need to deploy more than tens of thousands of edge devices as we may put an edge device each 10 to 100 m 2 to receive any task from mobile devices. Because many devices are needed at the network edge, in general, this makes Jetson Nano is a more preferable edge device to further save the cost of building edge devices while providing required computing power.
Here we use Jetson Nano as the edge device for studying the performance of erasure coding. As we introduced in Section II-B2, the performance of erasure coding are mainly impacted by two factors:packet size and w. We run a set of experiments to see the impact of w and packet size on the performance of erasure coding. We use four sets of (k, m): (2,1), (3,2), (6, 3) and (10,4). The (2, 1) and (3,2) are the configurations for RAID 5 and RAID 6 [42] respectively. The (6,3) is the setting for QFS [43], and the (10, 4) is the one for Facebook f4 storage system [7]. We set the block size as 1 MB. Each measurement runs ten times and calculates the mean throughput and its variance. The dot in each line is the mean throughput in our figures for each measurement. The distance of the vertical line of each dot represents the standard deviation.

1) THE IMPACT OF PACKET SIZE ON CODING
As illustrated in Fig. 3, Jerasure accesses k × w packets from a column of k data blocks to generate one packet for parity blocks, w successive packets are accessed from each data block. As a result, it could have a great impact on the performance of erasure coding because of the temporal locality and spatial locality on CPUs. Note that the G-CRS uses many threads to make concurrent access to the memory, so the packet size for G-CRS is fixed as 8 bytes to obtain efficient data access. Thus, we only study the impact of packet size on Jerasure. 96276 VOLUME 8, 2020  The packet size is set from 8 bytes to 8 KB. We fixed the value of w as 4. The results of our measurement on Jetson Nano is presented in Fig. 5. The packet size has a great impact on the performance of erasure coding on CPU based on the results. For each set, we can see the performance grow greatly with the increase of packet size from 8 bytes to 1 KB. The throughput achieves times growth from 8 bytes to 1 KB for each set of k and m. Then in some cases, there is also slight performance increased with the increase of packet size from 1 KB to 4 KB. And some encounters a slight performance decrease from 4 KB to 8 KB.
The case with k as 2 and m as 1 has the highest throughput. When k increases to 3 and m increases to 2, we can see the highest throughput reduced to less than its one third of the case of k as 2 and m as 1. This reveals the performance is greatly limited by the memory bandwidth and computation power of a single CPU core. With no doubt, the one with k as 10 and m as 4 has the minimum throughput.
In summary, by setting a proper packet size for CRS coding, we can make good use of temporal locality and spatial locality on CPU. Meanwhile, except the set with k is 2 and m is 1, the throughput is less than 1000 MB/s for all other three sets. As a result, the erasure coding may be a performance bottleneck on edge devices.

2) THE IMPACT OF w ON CODING
When studying the parameter w, we set the packet size for Jerasure as 1 KB as it can achieve good performance from our measurement. The value of w for Jerasure is 4, 8, 16 and 32, and the value of w for G-CRS is from 4 to 8.  Fi 6. And the value of w seems to have no impact on the encoding and decoding performance. The bandwidth between CPU and main memory could be the limitation for encoding and decoding. However, for other sets of values: (3,2), (6,3) and (10,4), the throughput is less than 1000 MB/s. With the increasing of k and m, the throughput decreases. Meanwhile, double the value of w means reducing the throughput to nearly half.
From the results, we can see that a higher value of k and m means higher computing consumption and the value of w has a great impact on the performance of erasure coding. When using the smallest value of w, we can achieve the highest throughput. Thus, based on the mathematic property of CRS codes that w should satisfy that 2 w must be greater than (k+m), we can select the smallest value that matches the requirement to obtain good performance. Fig. 7 demonstrates our measurements of G-CRS. The throughput of each set is quite different from the one of Jerasure. In general, the set with k as 10 and m as 4 has the highest throughput. Meanwhile, for the set of (k, m) as (10,4) and (6,3), they have the biggest variance. While on average the throughput of G-CRS with the set with k as 10 and m as 4 can reach more than 1000 MB/s, the high variance indicates its real performance may not reach the requirement of a stable throughput.
The change of the value w has no big impact on the performance of G-CRS. The performance of G-CRS can be limited by two factors: one is the limited memory throughput between main memory and the GPU thread when making concurrent access. The other one is the limited computing power of the Jetson Nano.
In general, we can see that the change of the value w has a big impact on the performance of Jerasure. There is no VOLUME 8, 2020 obvious impact on G-CRS, but the high variance means its performance is quite unstable on Jetson Nano.
We present an erasure-coded storage system for edge computing in this section. Meanwhile, we make a study of the performance of erasure coding on edge device Jetson Nano with two state-of-the-art erasure coding libraries: Jerasure and G-CRS. From our measurements, the G-CRS can obtain 1000 MB/s throughput on average, but the high variance reveals it cannot perform well on Jetson Nano. The throughput of Jerasure didn't match the bandwidth of 5G and Wi-Fi 6 in most cases. To remove the bottleneck of erasure coding on edge, we need to improve the performance of erasure coding.

IV. PARALLEL ERASURE CODING WITH OpenMP
Our study of the impact with packet size and w reveals that G-CRS suffers high variance and Jerasure is hard to meet 1000 MB/s, which means it is not able to saturate the bandwidth of 5G and Wi-Fi 6. This leaves using multi-core techniques to accelerate erasure coding become a choice.
In this section, we present our design and implementation of paralleling erasure coding on a multi-core CPU. We first present the design of our parallel solution on a multi-core CPU. Then, we illustrate our portable implementation with a good performance by using OpenMP on it.

A. THE DESIGN OF OUR PARALLEL SOLUTION
There are four CPU cores on Jetson Nano. If we can make full utilization of those CPU cores, we may be able to achieve up to four times of performance growth as the performance of Jerasure using a single CPU core. Namely, we use four threads that each thread makes a 100% utilization of each CPU core when performing erasure coding.
We introduce the job of erasure coding as follows. A file with H bytes are divided into H/k bytes for each data block. As presented in Figure 3, the bitmatrix multiple the k data blocks to generate m parity blocks. We use T threads to perform the coding task in parallel. When the value of T is 1, it generates each packet of m parity blocks by multiplying one row of bitmatrix with one column of k data blocks. When T is bigger than one, we must divide the erasure coding task evenly to perform the erasure coding in parallel.
One existing approach is to assign the generation of each parity block to T threads in a round-robin fashion [38]. It may lack efficiency in some situations. For example, when we have two threads to generate 3 parity blocks, where m is 3, one thread is responsible for generating 2 parity blocks and another one only generates 1 parity blocks. When the first thread is generating the second parity block, another thread stays idle. This indicates there exists no efficient utilization of available threads.
We can assign the coding task of each packet to each thread in a round-robin fashion as the one for G-CRS. In this method, we can assign all packets evenly to all threads. However, each thread needs to calculate the packet it is responsible for coding. This will introduce extra computation tasks and become a performance penalty to decrease the performance of erasure coding.
To make better utilization of available T threads and meanwhile provides high performance, we need to assign the task of erasure coding evenly while introducing least extra computation cost, which means each thread is responsible for generating nearly the same number of packets without frequently calculating the location of the packet assigned to the thread. To achieve this goal, we further divide each data block into equally-sized T subblocks. Fig 8 presents how we divide each blocks into T subblocks. The size of each subblock is H/kT bytes. As a result, each thread can access the bitmatrix and k subblocks from k data blocks to generate m subblocks for m parity blocks. In this way, we obtain good performance by paralleling the task of erasure coding efficiently in design.

B. THE IMPLEMENTATION OF OUR DESIGN
We are facing three challenges when accelerating erasure coding on a multi-core CPU. The first is how to obtain good performance by making good use of available computation resources. Meanwhile, we need to make our implementation portable so that no extra effort is required to compile and execute it. The last one is that our implementation must be consistent with other existing API. Only in this way, our work can be easily adopted by erasure-coded storage systems. Thus, we need to tackle these challenges when implementing our design.
A simple way to parallel erasure coding is to use a system call to create threads each time when issuing multi-threads to perform a task in parallel. However, the system call creates extra performance penalties when performing encoding and decoding operation each time. As a result, this can't obtain our desired performance. To avoid frequent creation of threads, a better solution is to create the number of required threads at the start of the system and each thread is waiting for any coding task. But the required synchronization mechanism at user-level code also incurs extra performance penalties when assigning a coding task.
To obtain a good performance, we adopt the sharedmemory programming paradigm OpenMP [16]. It creates a thread pool that threads in the pool can be reused when created and maintained the pool at kernel-level. Each thread uses a spinlock to wait for any task when it is idle. By using OpenMP, we can parallel erasure coding with T threads easily and avoid the performance penalties incurred by creating threads. As OpenMP is supported by all platforms, our implementation has no platform-dependent issue.
We present our implementation with OpenMP in Listing 1. We use a data structure presented from line 1 to line 3 to hold the location of each k subblock of data blocks and m subblock of parity blocks. We then implement a function named openmp_erasurecoding to parallel the process of erasure coding with T threads. We use a macro definition to give the value of T. The implementation calls the function provided by OpenMP to set the number of threads as T when starting to execute the program. You can reset the number of threads for performing parallel computing by recalling the function provided by OpenMP. In this way, you need to reset the value of T for synchronization.
Our implementation of performing erasure coding only use a for-loop(from line 14 to line 22 in Listing 1). There are two inner for-loop(One from line 15 to line 17, another one from line 18 to line 20), they are assigning the subblock of data blocks for each thread and the subblock of parity blocks that required to be generated by each thread as presented in Fig 8. It doesn't have any system call to create T threads. The parallel directive #parallel omp parallel for tells the OpenMP to use T threads to execute the part from line 15 to line 21. Each iteration in the for-loop from line 14 to line 22 is executed by a single thread from the thread pool. Since there is no data and code dependent for each iteration, T threads can execute in parallel. In this way, we use T threads to perform erasure coding in parallel. In the for-loop, we use API from Jerasure to perform erasure coding without any modification(at line 21). Our implemented API openmp_erasurecoding for performing coding is a user-friendly application programming interface. It uses the same input parameters as the one provided by Jerasure, and it uses the API illustrated in line 21.
Our solution with parallel erasure coding is built on top of the Jerasure library by using its API to perform erasure coding. Note that our implementation is written in C programming language and is similar to our demonstration in Listing 1, the work can be compiled and used in any CPU architecture and any platform that supports the OpenMP programming paradigm without any modification.
As both encoding and decoding are the same operations essentially from the perspective of using k equally-sized blocks to generate m equally-sized blocks with a bitmatrix, while Jerasure uses two functions to perform encoding and decoding separately, our implementation of one API is suitable for both encoding and decoding operation in Listing 1). When decoding is required, we only need to recalculate the bitmatrix for decoding and use the API for parallel coding. Our API is consistent with the one of Jerasure and can be easily replaced to obtain good performance by existing erasurecoded storage systems.
In summary, we design and implement a portable solution of accelerating erasure coding on a multi-core CPU by using the shared-memory programming paradigm OpenMP. Our implementation can obtain good performance and our API is consistent with Jerasure to be easily adopted by existing erasure-coded storage systems.

V. PERFORMANE EVALUATION
We already present an evaluation of the performance of erasure coding with the impact of w and packet size in Section III-B on edge device Jetson Nano. Because both throughput and latency are important factors for erasurecoded storage systems at the network edge, we study the performance of erasure coding on Jetson Nano in terms of throughput and latency in this section.
In this section, we first set the size of each block as 1 MB to evaluate the throughput. When measuring the latency, we set the size of each block in the range from 16 KB to 512 KB. Because when measuring the access latency of storage systems, KB-level block size is a proper choice. The adopted erasure code for measurement is the CRS code. We set the packet size as 1 KB for Jerasure and we select the smallest value of w to achieve good performance. The number of threads is set to 2 and 4 when using OpenMP to parallel erasure coding. Each measurement is executed ten times and we calculate the mean value and variance. The dot in each line is the mean value in our figures for each measurement in this section. The distance of the vertical line of each dot represents the standard deviation.

A. THROUGHPUT OF ERASURE CODING
When measuring the throughput of erasure coding, we set the value of m as 1, 2, 4 and 8. The value of k are ranged from m to VOLUME 8, 2020  42. Figure 9 presents the encoding throughput and Figure 10 presents the decoding throughput.
We can see that the mean throughput of G-CRS outperforms the others in most cases. Its mean throughput can be more than 1000 MB/s in most cases. However, it suffers a high variance in all cases. Although G-CRS has the highest throughput, only when k is bigger than 15, the lowest value of the standard deviation is around 1000 MB/s in most cases. But the values of k are typically smaller than 15 in the presented data center and cloud storage systems as they use a smaller value of k and m to obtain good system performance. This reveals it may not able to obtain the desired performance for some specified configurations of k and m for state-of-theart erasure-coded storage systems in practice. For example, when k is 10 and m is smaller than 3. We can see the lowest throughput may be several hundreds of MB/s, which is far from the bandwidth of wireless network 5G and Wi-Fi 6.
Due to the limited computing power of a single core on the CPU on Jetson Nano, the Jerasure has the lowest throughput in all cases. It also reveals the computation is the limitation for the throughput of erasure coding on CPU. Unlike G-CRS, the throughput of Jerasure has a small variance, which means the encoding and decoding throughput is very stable when executing Jerasure on a single CPU core of Jetson Nano.
When using OpenMP to accelerate erasure coding, we can see there is obvious performance growth with parallel erasure coding on CPU. The throughput is nearly double of Jerasure with two threads. And we can still have nearly 50% performance growth from two threads to four threads. When the value of m is 1 and 2 and the value of k equal to or smaller than 10, our acceleration with four threads can achieve more than 1000 MB/s coding throughput for the cases with m as 1 and 2. But when m is 4 and 8, the performance is hard to reach 1000 MB/s due to limited computing power. It also has a small variance, which means the performance is quite stable with our parallel solution. As in erasure-coded storage systems, they mainly use a small value of k and m in their configurations in practice, our acceleration can obtain the required performance for some specified configurations.
Except for G-CRS, others suffered a big performance decrease when k+m is 17. That is because the value of w is increased from 4 to 8 to satisfy the condition that 2 w must be equal to or greater than k+m. Since the high variance exists in G-CRS and it may hide other performance penalties, the impact of w is unable to see in the measurements.
Our evaluation demonstrates although G-CRS has the highest average throughput in most cases, the throughput may unable to satisfy the requirement of 5G and Wi-Fi 6 in some cases due to high variance in its throughput. Our parallel acceleration on CPU with OpenMP can be a supplement to achieve more than 1000 MB/s coding throughput for some specified values of k and m. In a word, our acceleration of erasure coding on multi-core CPUs with OpenMP can be a proper candidate for erasure-coded storage systemsa.

B. LATENCY OF ERASURE CODING
We use four sets of (k,m):(2,1), (3,2), (6,3) and (10,4) to study the latency of erasure coding as the one in Section III-B,  as they are the configurations for erasure-coded storage systems for the data center and cloud storage in practice. The block size is set from 16 KB to 512 KB. Note that when using four threads to parallel erasure coding, as the value of w is 4 and the packet size is 1 KB, the smallest subblock is the w multiply the packet size, which is 4 KB. This means the smallest block size that can be divided with four threads is 16 KB. We may reduce the size of packet size if we use four threads to perform coding on data smaller than 16 KB. Figure 11 and Figure 12 present the encoding latency and decoding latency respectively. Except for cases when the block size is 16 KB, our parallel acceleration with four threads on CPU achieves the lowest latency in nearly all cases. Meanwhile, the variance is low with four threads. And the one with two threads has lower latency than Jerasure. This reveals our acceleration can obtain low latency when performing erasure coding on small size data.
Jerasure, which uses only a single core to perform encoding and decoding operations, has the highest latency in most cases. In very few cases, like the one in Figure 12 (b), it also has high variance. During our measurement, some background tasks may cause the contention of system resources in seldom cases. But overall, the performance is quite stable. This is also proved by the performance of our parallelization on a multi-core CPU on top of Jerasure, which is quite stable in its performance in most cases.
The cases for C-CRS have smaller variance than the ones in Figure 9 and Figure 10. However, it achieves the highest latency in most cases with the set the value of k as 2 and the value of m as 1. When both k and m increases, the latency is lower than the Jerasure, but is close to our parallel implementation with two threads. When accelerating CRS coding on GPU, it requires more data to obtain high parallelism. Thus, G-CRS doesn't have an advantage when performing erasure coding on small size data.
When make a horizontal comparison for G-CRS between Figure 9, Figure 10, Figure 11, and Figure 12, we can see that G-CRS has lower variance with small block sizes from 16 KB to 512 KB. But the variance is quite high when the block size is 1 MB. The G-CRS makes better use of GPU resources by issuing more threads to execute concurrently. The number of threads is block size 8 Bytes as each thread is responsible for coding m packets. Since only 128 CUDA cores exist in Jetson Nano, more threads mean performance penalties caused by context switch between warp, which contains 32 threads to execute concurrently. It indicates, although higher performance can be obtained by G-CRS through increasing block size, the higher number of threads can cause a high variance of performance for data-intensive and compute-intensive applications. Thus, a tradeoff exists between stable but lower performance and high variance but better performance.
In summary, our acceleration on the multi-core CPU by using OpenMP can obtain good throughput for some specified sets of k and m. And it also achieves low latency for small size data. As a result, it can be a supplement to obtain good erasure coding performance for our erasure-coded storage system built for edge computing in practice when G-CRS has unstable performance with high variance.

VI. CONCLUSION
In this paper, we present an erasure-coded storage system for edge computing. We make a comprehensive study of a key performance bottleneck, which is erasure coding, of our presented system. Our study demonstrates although G-CRS achieves the highest throughput in most cases, the high variance of its performance may indicate it may not work well for some specified k and m. It also indicates when shifting algorithms and applications optimized for the data center and cloud systems, a measurement of their performance is required to see if they can fulfill the requirement.
Our supplement solution is to use OpenMP to parallel erasure coding on a multi-core CPU to make the throughput of encoding and decoding operations reach the network bandwidth of 5G and Wi-Fi 6 and reduce the latency for erasure coding of small size data at the network edge. Our work can be served as a reference to further optimization of erasure-coded storage systems for edge computing. We leave the implementation and the study of the whole system performance of our presented erasure-coded storage system for edge computing a future work.