Cost-Effective Data Placement in Edge Storage Systems With Erasure Code

Edge computing, as a new computing paradigm, brings cloud computing’s computing and storage capacities to network edge for providing low latency services for users. The networked edge servers in a specific area constitute <italic>edge storage systems</italic> (ESSs), where popular data can be stored to serve the users in the area. The novel ESSs raise many new opportunities as well as unprecedented challenges. Most existing studies of ESSs focus on the storage of data replicas in the system to ensure low data retrieval latency for users. However, replica-based edge storage strategies can easily incur high storage costs. It is not cost-effective to store massive replicas of large-size data, especially those that do not require real-time access at the edge, e.g., system upgrade files, popular app installation files, videos in online games. It may not even be possible due to the constrained storage resources on edge servers. In this article, we make the first attempt to investigate the use of erasure codes in cost-effective data storage at the edge. The focus is to find the optimal strategy for placing coded data blocks on the edge servers in an ESS, aiming to minimize the storage cost while serving all the users in the system. We first model this novel <italic>Erasure Coding based Edge Data Placement</italic> (EC-EDP) problem as an integer linear programming problem and prove its <inline-formula><tex-math notation="LaTeX">$\mathcal {NP}$</tex-math><alternatives><mml:math><mml:mi mathvariant="script">NP</mml:mi></mml:math><inline-graphic xlink:href="jin-ieq1-3152849.gif"/></alternatives></inline-formula>-hardness. Then, we propose an optimal approach named EC-EDP-O based on integer programming. Another approximation algorithm named EC-EDP-V is proposed to address the high computation complexity of large-scale EC-EDP scenarios efficiently. The extensive experimental results demonstrate that EC-EDP-O and EC-EDP-V can save an average of 68.58% (and up to 81.16% in large-scale scenarios) storage cost compared with replica-based storage approaches.


INTRODUCTION
I N recent years, the world has witnessed the explosive growth of smart devices and mobile users. It is predicted that by 2021 there will be 32 billion connected mobile devices, and the global data traffic will reach 19.5 ZB per year [1]. The transmission of such massive data incurs heavy network traffic and consumes excessive network resources, leading to network issues, including service interruptions and high network latency. To tackle these challenges, edge computing has emerged as a new computing paradigm. It moves computing and storage resources onto edge servers at the edge of the cloud [2], [3]. Edge computing offers two key advantages to various online applications. First, users' data retrieval latency can be significantly reduced because they can retrieve data from their nearby edge servers rather than from the cloud. From app vendors' perspective, this ensures their users' quality of experience (QoE) because latency has become the key performance concern for online applications [4]. Second, service interruptions caused by network congestion can be alleviated by reducing the network traffic over the backhaul network [5]. This benefits app vendors by reducing the costs incurred by transmitting their data from the remote cloud to edge servers [6], [7].
In the edge computing environment, adjacent edge servers in an area are connected by high-speed links [8] to form an edge server network that constitutes an edge storage system (ESS) [3], [6], [9]. Compared with the edge-cloud architecture, ESS overcomes the single-point failure problem and performance bottleneck problem encountered [6], [10]. New challenges raised by ESS are starting to attract researchers' attention in recent years, who attempt to achieve various optimization objectives by caching data replicas on edge servers, e.g., minimum data retrieval latency [11], maximum cache hit ratio [12], [13], maximum caching benefits [14], [15], maximum caching capacity [16], [17]. A common assumption made by these replica-based approaches is that storage resources on each edge server in an ESS can always be hired on-demand or reserved in advance for caching data replicas. However, this assumption is not always realistic in a real-world edge computing environment where edge servers' storage resources are limited by the constrained physical sizes of base stations [18]. Even if it is feasible, caching massive data replicas on dense edge servers in an ESS -is often not cost-effective because the storage resources on edge servers are expensive [19]. This issue is especially critical when app vendors need to store large-size data that do not require real-time access, e.g., system upgrade files, popular app installation files, videos in online games, in ESSs mainly to save on the expenses incurred by transmitting data out of the cloud for every user. A new approach is needed to enable the cost-effective storage of such large data in ESSs.
In this paper, we study the use of erasure coding in costeffective storage of large data in ESSs. Under an erasure code scheme, a data X to be stored can be divided into M data blocks and K parity blocks. These data and parity blocks are distributed to be stored on different storage nodes (e.g., edge servers in an ESS) accessible to users. A user can retrieve M data and/or parity blocks (together referred to as coded blocks hereafter) from any accessible edge servers to construct X for use. Erasure codes have been widely employed to reduce storage costs in cloud-based storage systems [20], [21]. However, the unique characteristics that fundamentally differentiate ESSs from cloud-based storage systems render existing approaches obsolete and raise a number of new challenges. First, in the edge computing environment, the coverage of an edge server is limited. A user can only access coded blocks from edge servers that cover the user. This is the proximity constraint [22], [23]. A storage approach based on erasure code (referred to as EC-based storage approach hereafter) must ensure that every user in the area can retrieve enough coded blocks to construct data X . This is the encoding constraint. In addition, data can be transmitted across edge servers over the edge server network topology to be delivered to users, but only within a limited number of transmission network hops [8], [18], [24]. Compared with traditional cloud storage systems, coded data blocks cannot be stored in different storage nodes arbitrarily in ESSs. This is the transmission constraint [6], [23], [25].
From the perspective of app vendors, the coded blocks of data stored in an ESS must be able to serve all the users at minimum storage cost while fulfilling the proximity, coverage, and transmission constraints. This problem is referred to as the erasure coding based edge data placement (EC-EDP) problem. This paper makes the first attempt to study this new problem, and the key contributions include: We formally model the EC-EDP problem as an integer linear programming problem and prove that it is NP-hard. We propose EC-EDP-O, an approach for finding optimal solutions to small-scale EC-EDP problems based on integer programming solvers. We propose EC-EDP-V, an approximation approach, which used to find approximate solutions to largescale EC-EDP problems with a ln Q limit þ 1 approximation ratio guarantee. We evaluate the performance of EC-EDP-O and EC-EDP-V against five representative approaches through extensive experiments conducted on a widely-used EUA dataset.

MOTIVATING EXAMPLE
Since Windows-10, Microsoft has employed peer-to-peer distribution, in addition to traditional client-server distribution, to deliver large upgrade packages to its clients to reduce the network resource consumption incurred. In the meantime, app vendors like Microsoft can significantly reduce the costs of distributing such data to their clients by storing it in the ESSs facilitated at the network edge -Amazon Web Services charges up to US$0.11 to transfer 1GB data out of its S3 data storage facilities to the internet 1 . Fig. 1 presents an example of ESS comprised of ten networked edge servers collectively serving the users in a specific area, e.g., New York CBD. A straightforward replicabased solution to the distribution of Microsoft's 1GB Windows-10 upgrade package is to store a replica on each of the ten edge servers. In the edge computing environment, a user can access the edge servers that cover the user. The distance between the user and the edge server may impact its data rate, as considered in some studies, but not the latency between them. Thus, the latencies between users and edge servers are not considered in the formulation of EC-EDP strategies in this study. In this way, all the users in the New York CBD can download the package from their nearby edge servers. However, there are two critical limitations to this solution. First, it costs Microsoft tremendously to save 10 data replicas (10GB in total) in the ESS over a long time due to the expensive storage resources on the edge servers [19], [23], [26]. Second, it does not take advantage of the collaboration of edge servers to transmit data to each other to deliver data for users [8], [18], [24]. Take the ESS presented in Fig. 1 for example. Assume that it allows data to be transmitted via two hops over the topology of edge sever network. In real-world EC-EDP scenarios, the transmission latency between edge servers could be different. To generalize the models and approaches presented in this paper, we measure the transmission constraint by the number of hops over the edge server network, which can also be easily measured by specific milliseconds, similar to [6], [27]. On this edge storage system, Microsoft only needs to store two replicas of its Windows-10 upgrade package (2GB in total) in the system to serve all the users in the area, e.g., v 4 and v 9 , as illustrated in Fig. 2a, or v 5 and v 8 .
To further reduce the storage cost of storing data in this ESS, Microsoft can first encode the upgrade package into 2 1. https://aws.amazon.com/s3/pricing/ data blocks (0.5GB each) and 1 parity block (0.5GB) through erasure coding to be placed on 3 of the edge servers in the ESS, e.g., v 4 ; v 5 , and v 8 , as illustrated in Fig. 2b. Under the erasure coding scheme, a user can retrieve any 2 of the three coded blocks from edge servers within two network hops to construct the upgrade package. For example, the user in Fig. 2b can retrieve a data block and a coded block from v 4 and v 8 , respectively, to construct the upgrade package. In this way, Microsoft only needs to store a total of 1.5GB data in the ESS to satisfy all the user's access requests in the area, much less than the above replica-based approach.
An alternative solution is to encode the package into 3+1 coded blocks, 0.33GB each, to be placed at v 4 , v 5 , v 8 , and v 9 as illustrated in Fig. 2c. This solution requires 1.33GB storage resources in total. Fig. 2d illustrates the third solution that encodes the package into 4 coded blocks to be placed. A total of 2GB storage resource is needed. Among the three solutions shown in Fig. 2, the ECð3; 1Þ solution presented in Fig. 2c incurs the least storage cost. Compared with replicabased storage solutions, EC-based solutions are more flexible because they require less storage occupation on individual edge servers. This is a critical advantage in the edge computing environment where the storage resources of edge servers are highly constrained and expensive [19], [23], [26].
Given an ESS, there are usually a large number of feasible EC-EDP solutions combining different data encoding and placement strategies. These solutions incur different storage costs. Meanwhile, in the real-world EC-EDP scenarios, the number of edge servers could be much larger and the network topology could be more complex. Finding the solutions to the EC-EDP problems in such scenarios is challenging. Therefore, it is important for app vendors to find the optimal one that serves all the users in the ESS at minimum storage cost. Please note that EC-based approach incurs computational overheads to users, i.e., the time taken to construct data from coded blocks [28], [29]. Thus, EC-based approaches are most suitable for storing large data that do not require realtime access but consume a large amount of network bandwidth, e.g., system upgrade files, popular app installation files, videos in online games.

PRELIMINARIES
Erasure coding is widely used in the field of distributed storage system to yield low storage overhead and high reliability, such as Microsoft's Azure [20] and Facebook's F4 [30]. By applying erasure coding, a piece of data is divided into M data blocks, which are encoded into K parity blocks. The total of M þ K coded blocks is distributed to be stored on M þ K nodes. The data can be constructed from any M of the M þ K coded blocks [31]. Fig. 3 presents an example where ECð3; 2Þ erasure coding (M ¼ 3; K ¼ 2) is employed to encode data X . Data X is divided into three data blocks f 1 , f 2 , and f 3 , which are encoded into two parity blocks f 0 1 and f 0 2 . The five coded blocks can be distributed to be stored on different edge servers. To construct data X , any user needs to retrieve at least three of the five coded blocks in the ESS. The encoding principle of the erasure code is to multiply the data by the coding matrix, and the decoding process is realized with the matrix inversion technique [28]. Actually, to ensure that the result of multiplication remains within a fixed size such as one byte, the results of matrix multiplication in the erasure code are obtained by mapping the matrix multiplication to a finite field [29].
In this research, we study the most general EC-EDP scenarios where at most one coded block on each edge server in the ESS. This storage limit generalizes the number of coded blocks that can be stored on each edge server. Allowing multiple coded blocks to be stored on each edge server will make it easier to find a storage solution but will lower the reliability of the data stored in the system. Take an extreme case for example, where all the M þ K coded blocks are stored on only one of the edge servers in the ESS to serve all the users. If that edge server fails, the data will become unavailable to all the users. On the contrary, if only one coded block can be stored on each edge server, the failure of an edge server does not significantly lower the reliability of the data. In fact, the ESS may still be able to serve all the users as long as they can still retrieve M coded blocks. The storage limit also generalizes our EC-EDP approach (to be presented in Section 5) by relaxing the need for app vendors to reserve a large number of storage resources on individual edge servers.

MODEL AND PROBLEM FORMULATION
In this section, we first formulate the EC-EDP problem and then reduce it to another classic NP-hard problem for proving its NP-hardness. The main notations used throughout this paper with their definitions can be found in Table 1.

Problem Formulation
Similar to [6], the S networked edge servers in an ESS can be modeled as a undirected graph GðV; EÞ. In this graph G, each edge server v i 2 V corresponds to a vertex, and the connection between two edge servers corresponds to an edge. In the edge computing environment, encoding multiple data for storage in an edge storage system is not costeffective. For example, if five data are encoded as a bundle into a number of coded blocks, a user requesting one of these data will have to retrieve the coded blocks for constructing all five data. Transmitting these coded blocks will consume extra network resources. It will also take extra time for the user to construct data from the coded blocks. Thus, encoding multiple data for storage is not cost-effective in the context of this study.
Given a coded block f t , divided from X , and a set of edge servers v i , a block placement decision, denoted by r i;t , indicates whether block f t is placed on edge server v i Let q i denote the number of coded blocks placed on server v i . It can be calculated as follow: Constraint (3) enforces the storage limit, i.e., at most one coded block needs to be stored on any edge server.
Let h limit represent the transmission constraint introduced in Sections 1 and 2. Let a i;j 2 f0; 1g indicate whether user u j can access server v i , and b j indicate the number of edge servers that user u j can access without violating the transmission constraint. There is where h w;j is the distance (measured by hops) between user u j and edge server v w with a coded block.
To ensure that each user can retrieve adequate coded blocks for constructing data X , M, i.e., the number of data blocks divided from X , should not exceed the minimum number of edge servers accessible to any users u j 2 U within h limit hops over G. Theoretically, an erasure coding schemed divides X into at least 2 data blocks. Thus, there is According to the encoding constraint, when users are covered by more than one edge server, they can access any one of these. Take Fig. 1 for example, u 2 can only directly access edge servers v 4 and v 5 . Let d i;j denote the minimum distance from user u j 2 U to server v i 2 V and it can be calculated as follow: To ensure that each u j 2 U can retrieve adequate coded blocks for constructing X , there must be at least M edge servers with a coded block within h limit hops over the edge server network where P s i b i;j is the number of coded blocks that user u j can retrieve within h limit network hops. Please note that variable b i;j is defined to ensure the feasibility of the data placement strategy by allowing users to retrieve necessary coded blocks for constructing data X under the transmission constraint, which is different from the definition of variable a i;j .
The optimization objective of the EC-EDP problem is to minimize the total storage cost. It is calculated based on the number and the size of the coded blocks stored in the ESS. Since matrix multiplication and inverse operations are involved in every erasure coding scheme, the size of coded blocks must be divided equally. Thus, the size of each coded block, denoted by sizeðf t Þ 2 F can be calculated by sizeðf t Þ ¼ sizeðXÞ=M. Let N , M þ K. Based on the storage limit, there is The total storage cost incurred by an EC-EDP strategy is computed as N Á ðsizeðX Þ=MÞ. Given that sizeðXÞ is a constant specific to X , the cost incurred by an EC-EDP strategy R can be presented as follow: Thus, the optimization objective in the EC-EDP, i.e., to minimize the storage cost incurred, can be expressed as follow: min costðRÞ (12) s.t.: ð3Þ; ð6Þ; ð9Þ .

Problem Hardness
In this section, we reduce EC-EDP to a classic NP-hard problem, i.e., minimum dominating set (MDS) problem [32], for proving its NP-hardness.
Given an undirected graph denote the vertex set composed by vertex v i and its adjacent vertexes in G. The MDS problem can be expressed here. s:t:: The reduction from the MDS problem to the EC-EDP problem can be done as follows: 1) let the number of coded blocks M be a deterministic value; 2) let every user access a fixed edge server. Given an undirected graph G ¼ ðV; EÞ in the MDS problem, we can find an instance of the MDS problem MDSðV 0 ; E; wsÞ, where ws ¼ P i2V y i . We can also construct an instance of the EDP problem EDP ðV Ã ; E Ã ; csÞ with the reduction above where jV Ã j ¼ jV 0 j and jE Ã j ¼ jEj, and cs ¼ P j2U;i2V ðb i;j Þ. Then, constraint (9) can be converted to P j2ÇðiÞ y j ! M, where ÇðiÞ represents a vertex set comprised of vertex v i and the vertexes within h limit hops over G. We can easily see that it is equal to constraint (13a). According to (3), at most one coded block can be placed on each edge server. Thus, constraint (13b) can be fulfilled. In conclusion, any M values always satisfy the MDS problem. Thus, the EC-EDP problem is NP-hard.

APPROACH DESIGN
In this section, we first model the EC-EDP problem as an integer linear programming problem. Then, we propose two approaches, i.e., EC-EDP-O and EC-EDP-V. The EC-EDP-O approach is proposed to solve the small-scale EC-EDP scenarios based on integer programming. The EC-EDP-V is proposed to solve the large-scale EC-EDP scenarios with a lnðQ limit þ 1Þ approximation ratio guarantee.

Optimal Approach
The EC-EDP problem can be modeled as a integer linear programming (ILP). Given an edge server network G ¼ ðV; EÞ, where V ¼ fv 1 ; . . . ; v S g and E ¼ fe 1 ; . . . ; e P g, let us define a set of variable Y ¼ fy 1 ; . . . ; y S g to represent an EC-EDP strategy, where y i ¼ f0; 1g. If y i ¼ 1, it indicates that a coded block is placed on the edge server i, and y i ¼ 0 if not. Therefore, the formula of the ILP model for the EC-EDP problem is presented here s.t.: Constraint (16) guarantees that the users covered by edge server v i can only retrieve coded blocks within the transmission limit. Constraint (17) is converted from (9) to guarantee that every user can retrieve adequate coded blocks to construct data X . update the vote weight of edge server V i by w i m i ; 12: for v i 2 V do 13: Vote for v j 2 A i n S Ã M with w j ; 14: end 15: sort the edge servers in V by their votes; 16: find v k , i.e., the edge server with the most votes; 17: for v r 2 A k do 19: m r m r À 1; 20:

Approximation Approach
As proven in Section 4.2, the EC-EDP problem is NP-hard. EC-EDP-O may solve small-scale EC-EDP problems. However, it is not tractable in large-scale EC-EDP scenarios. To address the complexity of solving large-scale EC-EDP scenarios, an efficient approximation approach named EC-EDP-V is proposed. Algorithm 1 presents the pseudocode of EC-EDP-V, and Fig. 4 illustrates its approximation process for finding the solution to the EC-EDP problem presented in Fig. 2.
The key idea of EC-EDP-V is to select the edge servers that produce the maximum benefits by storing coded blocks in the edge storage system. We design a new voting mechanism where EC-EDP-V adjusts the voting weight for each edge server and each edge server votes for the edge servers within its h limit hops iteratively. In each iteration, the edge server with the highest number of votes will be selected. During the voting process, if multiple edge servers have the same highest number of votes, EC-EDP-V will randomly select one of them and update the vote weight for each edge server. First, the algorithm starts with an initial S Ã ¼ ;, which is used to save the current optimal solution of EC-EDP problem (Line 2). Note that A i (i ¼ 1; :::; n) on Line 3 is the set of edge servers within h limit hops from v i . On Line 4, A Ã i is the number of neighbor edge servers of edge server v i within h limit network hops limit. To find the final solution, the algorithm iterates for n times, one for each of the number of data blocks M, to produce n candidate EC-EDP solutions on Lines 7-21. In each iteration, the algorithm initiates the number of coded blocks needed for each edge server m i ¼ M (Line 7), the current storage cost C M ¼ 0, and the set of selected candidate edge servers S Ã ¼ ; (Lines 7-9). Take the ESS presented in Fig. 4a for example. Let us assume that data can be transmitted via two network hops. Edge servers v 1 ; v 2 ; :::; v 10 are initialized with the same vote weight of 3, i.e., the number of coded blocks needed for each edge server. Then, the algorithm loops Lines 10-21. In each iteration of the loop, it assigns m i as the vote weight w i to each edge server v i 2 V without a coded block (Line 11). Next, all the edge servers within h limit hops from v i vote for v i with vote weight w j (Lines 12-14). As shown in Fig. 4b, edge servers v 4 and v 5 receive 27 votes from their neighbor edge servers within 2 hops, i.e., 3 votes from each of v 1 , ..., v 9 and 3 votes from each of v 1 , ..., v 6 and v 8 , ..., v 10 , respectively. In this example, v 4 is chosen over v 5 . After that, all the edge servers in V are sorted by the number of their votes, the algorithm selects the one (v k ) with the most votes to be included into S Ã , i.e., the set candidate edge servers (Lines 15-17). Next, for each edge server v r 2 A k , the number of its required coded blocks m r decreases by 1 (Lines [18][19][20]. In this way, as shown in Fig. 4b, when edge server v 4 is chosen, its vote weight m 4 will decrease to 0. Let us now take a look at Fig. 4c, where v 4 and v 5 are chosen for their highest votes. The vote weights of their neighbor edge servers within 2 hops, including v 1 ; v 2 ; v 3 ; v 6 ; v 8 ; v 9 , and v 10 , decrease by 1. Next, it compares the current storage cost C M with all the candidate EC-EDP solutions. If it is lower than the current lowest storage cost, the corresponding EC-EDP solution S Ã M replaces the current best solution (Lines 22-25). As shown in Fig. 4d, the final solution contains v 4 ; v 5 ; v 8 , and v 9 . It achieves the lowest storage cost ratio of 1.33.

Theoretical Analysis
In this section, we theoretically analyze the approximation ratio and time complexity of the proposed approach EC-EDP-V.

Approximation Ratio
Given an edge server network G ¼ ðV; EÞ, let N h limit ðv i Þ denote the set of edge server v i 's neighbor edge servers within h limit hops, Q h limit ðGÞ denote the maximum number of N h limit ðv i Þ, opt ¼ f 0 opt ; :::; nÀ1 opt g denote the optimal solution to the EC-EDP problem, and denote the sub-optimal solution found by EC-EDP-V. For each edge server over the network topology of edge servers, the number of its neighbor edge servers within h limit network hops is less than Q h limit þ 1. When an edge server with the most votes is included into OPT , we have the following inequality n ðQ h limit þ 1Þ þ Q h limit Á ðj OPT j À 1Þ From (18), we can infer j OPT j ! ðn À 1Þ=Q h limit . Let us assume that the number of remaining encoded blocks to be placed after the i-th iteration in Algorithm 1 is c i with c 0 ¼ n. Considering the i-th iteration, the optimal solution can reduce the number of coded blocks by c i À 1. Thus, the lower bound of the number of selected edge servers in the i-th iteration by Algorithm 1 is dðc i À 1Þ=j OPT je. Now, we can infer: c iþ1 c i À dðc i À 1Þ=j OPT je By the inductive proof, we can easily prove (20) based on (19). The details of the proof are omitted here.
j OPT j , and the i th sub-decision is made, we can obtain the number of the remaining coded blocks to be placed as follow: This proves that after the j OPT j Á ln c 0 À1 j OPT j -th iteration, the number of remaining coded blocks will not exceed j OPT j þ 1. Let us assume that the iterative process will end by selecting c f more edge servers. The total number of selected edge servers fulfills: Therefore, the approximation ratio of EC-EDP-V algorithm is lnðQ h limit þ 1Þ.

Time Complexity
Suppose an EC-EDP problem with n edge servers V ¼ fv 1 ; v 2 ; :::; v n g in a geographic area. For each edge server v i 2 V , let p denote the average number of its neighbor edge servers within h limit hops. We first analyze the time complexity of Lines 11-14 of Algorithm 1. The voting process takes OðnÞ time because all of n edge servers will vote. The time complexity of sorting these edge servers and selecting the highest one on Line 15 at most Oðlog nÞ. The upper limit of M impacts the number of inner iterations (Lines 10-21), which is determined by arg minjAj. When arg minjA Ã j ! p, the complexity of the overall process in the worst-case EC-EDP scenario is no more than Oððn À pÞlog nÞ. After the inner iteration (Lines 10-21), Algorithm 1 has obtained a total of n À 2 candidate solutions. Therefore, the time complexity of Algorithm 1 is Oðn 2 log nÞ.

EVALUATION
In this section, we conduct extensive experiments to evaluate the performance of EC-EDP-O and EC-EDP-V in different EC-EDP scenarios.

Dataset
In order to evaluate these competing approaches realistically, we conduct the experiments on the realistic EUA data set 3 . This dataset contains 1,464 real-world edge servers and 131,312 users in Metropolitan Melbourne, Australia.

Competing Approaches
Five representative approaches are implemented in Java 8 to be compared against EC-EDP-O and EC-EDP-V: Greedy Degrees (GD): This EC-based approach selects the edge server with the highest degree to place coded blocks each time, until all the users are covered, i.e., they can all retrieve adequate coded blocks within h limit hops. Random Block Placement (RBP): This EC-based approach randomly selects one edge server at a time to place a coded block, one after another until all the users are covered.
LGEDC [33]: This replica-based approach heuristically places replicas of data X to minimize the storage cost while covering all the users within h limit hops. GRED [11]: This replica-based approach tries to spread replicas of data X across all the edge servers. Specifically, it first heuristically selects n h limit þ1 candidate edge servers that can serve all the users with minimum data retrieval latency within h limit hops. Then, from these candidate edge servers, it selects those that are connected to the fewest other candidate edge servers within h limit hops until all the users are covered. TMC18 [34]: This replica-based approach partitions edge servers into multiple groups with the Lagrangian method based on the number of user requests for data X received by individual edge servers. Then, it always places replicas of data X in the group with the lowest overall number of replicas of data X until all the users are covered. In the implementation of EC-EDP-O, IBM's CPLEX Optimizer is employed for finding the solution by traversing all possible solutions to the ILP problem.

Experiment Setup
Two scales of experimental settings are conducted. Set #1 is conducted within the Melbourne CBD area to evaluate the performance of EC-EDP-O and EC-EDP-V in small-scale EC-EDP scenarios. Set #2 is conducted within Metropolitan Melbourne to evaluate EC-EDP-V in large-scale EC-EDP scenarios. To facilitate comprehensive evaluations, we simulate different EC-EDP scenarios by varying the specific values of three setting parameters, as summarized in Table 2. 3. https://github.com/swinedge/eua-dataset Number of edge servers (n ¼ jV j): This parameter is the size of the edge server network G, increasing from 10 to 35 in Set #1 and from 50 to 250 in Set #2. Density of edge servers (d ¼ jEj=jV j): This parameter decides the density of the edge server network G. It increases from 1 to 2.5 in Set #1, from 2.0 to 5.0 in Set #2. Hop limit (h limit ): This parameter is specified to enforce the transmission constraint, increasing from 1 to 5 in both Set #1 and Set #2.

Performance Metrics
Two metrics are employed for performance evaluation: Storage cost (cost). The storage cost denotes the ability of an approach to achieve the optimization objective of the EC-EDP problem. It is calculated by Eq. (11), the lower the better. Computational overhead (time). This metric indicates the efficiency of an approach, which is measured by the CPU computation time, the lower the better.

Experimental Results
In this section, we comprehensively present and analyze the experimental results in Set #1 and Set #2.

Experiment Set #1
Effectiveness. Fig. 5 illustrates the storage costs incurred by the seven approaches and impacts of the three parameters in Set #1. We can clearly see the significant advantages of EC-based approaches over replica-based approaches in minimizing storage costs. Among all the four EC-based approaches, EC-EDP-O and EC-EDP-V are the clear winners in all the cases. This illustrates the importance of leveraging the ability of edge servers to cost-effectively utilize the constrained and expensive storage resources in the ESSs. EC-EDP-O achieves the lowest storage cost in all the cases. Compared with EC-EDP-O, EC-EDP-V incurs about 3.94% more storage cost on average in Set #1. Meanwhile, EC-EDP-V incurs much less storage costs compared with GD, RBP, LGEDC, TMC18, and GRED, by 23.99%, 36.19%, 56.28%, 53.47%, and 58.29%, respectively. Fig. 5a shows the impact of the number of edge servers n on the storage cost in Set #1.1. The storage costs incurred by the approaches increase when n increases. The storage costs incurred by LGEDC and RBP increase at higher rates compared with the other five approaches. When n increases, the scale of the EC-EDP problem increases. Accordingly, replicabased approaches need to place more data replicas to serve all the users. EC-based approaches also need to place more coded blocks to serve all the users. However, the total size of these extra coded blocks is much smaller than the extra data replicas to be placed by LGEDC, TMC18, and GRED. Among all the six approaches, EC-EDP-O always achieves the lowest storage costs, 4.01% lower than EC-EDP-V, 27.97% lower than GD, 33.35% lower than RBP, 59.43% lower than LGEDC, 53.30% lower than TMC18, and 60.06% lower than GRED on average. Fig. 5b demonstrates the impact of the edge server density d on storage costs in Set #1.2. When d increases, the storage costs incurred by all the seven approaches decrease. The root cause is that a larger d connects each edge server to more adjacent edge servers within h limit hops. Fewer coded blocks or data replicas need to be stored in the ESS to cover all the users. This immediately results in a decreases in the total storage cost incurred and indicates the importance of leveraging the collaboration of edge servers. In Set #1.2, EC-EDP-V outperforms GD, RBP, LGEDC, TMC18, and GRED by an average of 24.47%, 38.05%, 59.26%, 64.49%, and 70.98%, respectively. We can see that EC-EDP-O always achieves the lowest storage cost, 3.35% lower than EC-EDP-V on average. Fig. 5c shows the impact of the hop limit h limit in Set #1.3. When h limit increases, coded blocks or data replicas can travel via more hops to be delivered to the users. The total storage costs incurred by the approaches decrease accordingly. When h limit varies from 1 to 5, EC-EDP-V outperforms GD, RBP, LGEDC, TMC18, and GRED by an average of 24.34%, 41.21%, 53.85%, 53.47%, and 58.29%, respectively. EC-EDP-O, again, achieves the lowest storage costs in all the cases, outperforming EC-EDP-V by 5.31% on average.
Efficiency. In Fig. 6, we can clearly see that EC-EDP-O incurs the highest computational overhead of all in the entire set of experiments. This is expected and confirms the  The computational overheads of other approaches are close to 0, but not 0. Specifically, GD, RBP, LGEC, TMC18, and GRED's computation time is less than 5 milliseconds in Set #1. Therefore, Fig. 6 does not illustrate EC-EDP-V's efficiency clearly. In the next section, without EC-EDP-O, we will illustrate and discuss the performance differences between EC-EDP-V and GD, RBP, LGEDC, TMC18, GRED in large-scale EC-EDP scenarios in Set #2 clearly.

Experiment Set #2
Effectiveness. Fig. 7 demonstrates the advantages of EC-EDP-V approach in minimizing storage costs in large-scale EC-EDP scenarios. It always manages to achieve the lowest storage cost in all the cases in Set #2. Specifically, the storage cost achieved by EC-EDP-V is 55.63%, 65.7%, 79.01%, 81.06%, and 83.52% lower than GD, RBP, LGEDC, TMC18, and GRED, respectively.
As demonstrated in Fig. 7a, when n increases, the storage costs incurred by the EC-EDP strategies formulated by the approaches increase linearly. We can see EC-EDP-V's significant advantages over the other five approaches, i.e., 69.10%, 74.80%, 83.33%, 78.05%, and 80.24% over GD, RBP, LGEDC, TMC18, and GRED on average. The reason behind this is similar to Set #1 and thus is not repeated here. It is worth mentioning that when n reaches 250, the storage cost achieved by EC-EDP-V is only 20.23%, 19.56%, 18.23% of what is achieved by LGEDC, TMC18, and GRED. These are considerable storage cost savings and clearly show EC-EDP-V's prominent advantage over LGEDC, TMC18, and GRED in storing data cost-effectively in large-scale ESSs.
As illustrated in Figs. 7b and 7c, the impacts of the increases in d and h limit on storage cost in Set #2 are similar to what we observed in Set #1. Specifically, EC-EDP-V can save an average of 48.89% storage cost against GD, 61.15% against RBP, 76.58% against LGEDC, 82.86% against TMC18, and 85.61% against GRED. The underlying reasons are also similar to those in Set #1 and thus are not discussed in detail here.
Efficiency. Fig. 8 shows the computational time produced by all the approaches in Set #2. EC-EDP-V always takes more time than the other five approaches to find a solution, 170.71%, 205.59%, 241.88%, 233.26%, and 243.12% more than GD, RBP, LGEDC, TMC18, and GRED, respectively.    8 shows that computational overheads of all the the approaches increase gradually with n, d, and h limit . Overall, EC-EDP-V scales with n, d, and h limit , taking no more than 150 milliseconds. This significant advantages in minimizing storage costs over GD, RBP, LGEDC, TMC18, and GRED illustrated in Fig. 7 make its extra computational time tolerable. When n, d or h limit increases, the number of possible EC-EDP solutions that can fulfil all the constraints in the ILP model presented in Section 5.1 increases. According to Algorithm 1, it takes EC-EDP-V more iterations (Lines 10-21) to process the votes for each edge server. Thus, EC-EDP-V takes more time to complete.

Conclusion
The experimental results show that EC-EDP-O is a clear winner in small-scale EC-EDP scenarios, while EC-EDP-V is the best option for solving large-scale EC-EDP problems. They collectively offer a package for formulating cost-effective EC-EDP strategies in various ESSs.

RELATED WORK
The edge computing paradigm enables data caching at the network edge by facilitating edge storage systems (ESSs) within users' close geographic proximity. ESSs offer various novel opportunities and also raise many new challenges. It has attracted widespread attention in very recent years [11], [14], [34].
Existing studies of ESSs are performed from the perspective of edge infrastructure provider, e.g., Amazon and Verizon, aiming to achieve various optimization objectives by storing or caching data and data replicas on the edge servers in an ESS. To name a few, Xie et al. [11] propose GRED, an efficient edge data placement algorithm that aims to balance the data retrieval workloads across the entire ESS and shorten the path for delivering data to users. Zhang et al. [34] explore data placement in ESSs to minimize overall data retrieval latency based on network topology, traffic distribution, and data popularity. Ren et al. [14] propose a cooperative edge data caching framework for ESSs that sets up cooperative caching regions to minimize data caching density and to promote data retrieval at the edge instead of from the remote cloud.
To accommodate data traffic at the network edge costeffectively, network coding can be employed to split data into small blocks to be encoded for high data reliability and low storage occupation. Kim et al. [35] propose a coding framework that employ error-correcting data encoding and computation decoding to enable high data reliability in the edge computing environment. Wu et al. [36] introduce network coding into the mobile ad hoc network environment to minimize the energy required to transmit data between nodes. They model the physical broadcast links as a graph and construct a minimum-energy multicast tree as the optimal routing mechanism. Bulut et al. [37] study the erasure code based data routing problem in mobile networks and focus on parameter selection for reducing cost of message delivery. Xu et al. [38] propose a game theory based approach to jointly optimize the content service satisfaction degree and network throughput in edge caching systems by deploying network coding for data routing. However, these studies adopt the same assumption made for cloud storage systems, i.e., the storage nodes are fully and directly reachable to each other over high-speed links. This is, however, unrealistic in edge computing environment. In the edge computing environment, the topology of the edge server network must be properly considered.
In very recent years, researchers also start to investigate the use of ESSs from the perspective of app vendor. For example, Cao et al. [15] propose an auction-based approach for edge cache space allocation, aiming to maximize app vendor's caching benefits while guaranteeing the quality of services of different users. Xia et al. [23] propose CEDC-O, an online edge data caching algorithm, which aims to minimize app vendors' caching cost plus the data migration cost based on Lyapunov optimization. They also investigate the problem of cost-effective edge data distribution from the cloud to ESSs for app vendors [6].
It is widely acknowledged in these studies that the storage resources on edge servers are constrained and expensive [6], [39]. The competition among app vendors makes it hard and often impossible for them to hire or reserve adequate resources for storing large data. Thus, storing an app vendor's multiple data replicas in an ESS to serve users covered by different edge servers in the ESS will cost the app vendor deeply. It is in fact too expensive and too resource-demanding to be practical. Existing studies of ESSs accommodate app vendors' need for low service latency by leveraging the ability of ESSs to minimize data retrieval latency for users. There is a lack of effort in helping app vendors with storing large data in ESSs cost-effectively. In this paper, we innovatively employ erasure coding to tackle this particular challenge. The key idea is to encode data into a number of coded blocks to be placed on the edge servers in an ESS so that all the users in the ESS can be served at minimum storage cost. This problem is referred to as the EC-EDP problem in this paper.

CONCLUSION AND FUTURE WORK
In this paper, we employ erasure coding to tackle the new EC-EDP problem of storing large data cost-effectively in an edge storage system, aiming to serve all the users in the system for app vendors at minimum storage cost. We first introduced, motivated, and formulated the EC-EDP problem. Then, we proposed two approaches, one for solving small-scale EC-EDP problems optimally and the other for finding approximate solutions provable performance guarantee in large-scale EC-EDP scenarios. The extensive experimental results indicate that by leveraging erasure coding and the ability of edge servers to cooperate, our approaches can formulate cost-effective EC-EDP strategies efficiently.
This study establishes the foundation for further study of the EC-EDP problem. In the future, we will study the tradeoff between data reliability and storage cost in EC-based data storage at the edge.
Hai Jin (Fellow, IEEE) received the PhD degree in computer engineering from the Huazhong University of Science and Technology, China, in 1994. He is currently a chair professor of computer science and engineering with the Huazhong University of Science and Technology. He was with The University of Hong Kong between between 1998 and 2000, and a visiting scholar with the University of Southern California between 1999 and 2000. He has coauthored more than 20 books and authored or coauthored more than 900 research papers. His research interests include computer architecture, parallel and distributed computing, Big Data processing, data storage, and system security. In 1996, he was awarded the German Academic Exchange Service Fellowship to visit the Technical University of Chemnitz, Germany. Zilai Zeng is currently working toward the undergraduation degree with the Huazhong University of Science and Technology, China. His research interests include edge computing, parallel and distributed computing, service computing, and cloud computing.
Xiaoyu Xia received the master's degree from The University of Melbourne, Australia, in 2015. He is currently working toward the PhD degree with Deakin University. His research interests include edge computing, parallel and distributed computing, service computing, software engineering, and cloud computing.