The Doctrine of MEAN: Realizing Deduplication Storage at Unreliable Edge

Placing popular data at the network edge helps reduce the retrieval latency, but it also brings challenges to the limited edge storage space. Currently, using available yet not necessarily reliable edge resources is common sense for edge space expansion, while deploying deduplication storage strategies is a general method for better space utilization. However, a contradiction arises when jointly implementing data deduplication with unreliable edge resources. On the one hand, the deduplication policy stipulates that any data chunk can be stored exactly once; on the other hand, the use of unreliable resources imposes that data should be backed up for the seek of file availability. To resolve such contradiction, we propose MEAN, a deduplication-enabled storage system using unreliable resources at the network edge. The core idea of MEAN is to place similar files together for better deduplication and maintain replicas of popular files for higher reliability. We first formulate this problem and prove its NP-hardness, then provide efficient heuristics based on similarity-aware hierarchical clustering. Three different reliability scenarios are comprehensively considered to develop our algorithms. We also implement a prototype system and evaluate the performance of MEAN with a real-world dataset. The results show that MEAN can fortify the file hit ratio under unreliable environments by 77% while reducing the file retrieval delay up to 71%, compared with the state-of-the-art approach.


delays.
A key metric for such a system is the hit ratio, which quantifies the percentage of data requests that can be served at the network edge. Consequently, popular data is welcome to be stored there.
Currently, the edge cluster suffers from limited storage space and cannot cope with the explosive growth of data. It is estimated that the worldwide number of IoT-connected devices will reach 43 billion by 2023 [7], and 75% of data is projected to be created and processed outside the cloud by 2025 [8]. While this can be solved by renting more proprietary resources, it is not always a cost-friendly option for service providers. Therefore, many studies [3], [9], [10], [11] suggest expanding edge storage space by incorporating various available edge resources, even though some resources may be unreliable. They range from the idle resources provided by various enterprises and individuals to the resources reserved for other applications that are not fully utilized yet. This gives content providers both a cost-efficient and instant way to expand their storage space. By storing more files with extended space, this methodology can improve the hit ratio to some extent. The downside, nevertheless, is that many of these resources are often unreliable. Some edge servers may erase the stored content or leave the storage system at any time. Therefore, redundancy should be generated to guarantee file availability.
As a key technology to achieve space efficiency, data deduplication has been adopted by many modern storage systems [12], [13], [14], [15], [16]. A common practice for data deduplication is to split files into multiple fixed/variable-size chunks, and only one copy of each chunk is maintained [17], [18]. This methodology explores similarities between files, allowing the storage cluster to retain only unique data. In recent years, there have also been many efforts to implement deduplication for edge storage systems [2], [19]. It is reported that the redundancy of IoT and multimedia data can be reduced by over 70% with data deduplication [2], [14], [20]. In addition, according to the study conducted by Microsoft [21], the typical space saving is around 50-60% in general file share scenarios, while datasets with high duplication could see optimization rates of up to 95%, or a 20× reduction in storage utilization. Since more files can be stored with limited storage resources, the hit ratio is somehow improved.
In production storage systems, the above methodologies should be combined together so that the space can be extended freely and occupied with deduplicated chunks. However, it should be noted that these two methodologies may present some contradictions. On the one hand, the deduplication policy stipulates that any data chunk can be stored exactly once; on the other hand, the use of unreliable resources imposes that data should be backed up for the seek of file availability. For the extended space, it is not uncommon that a stored chunk becomes unavailable due to hardware failure, software crash, or the space is recycled by its responsible application. As a consequence, all the files which share that chunk will be incomplete and unavailable. Therefore, a crucial question here is: how should we use the unreliable space at the network edge, to store more deduplicated chunks of more files or to back up chunks in case of failures?
To resolve the above dilemma, we present MEAN, 1 a deduplication-enabled edge storage system using unreliable resources. With the ambition of a high hit ratio, MEAN takes no extreme policies (neither backing up nor deduplicating all the chunks), while it goes to the middle (replicating a part of chunks and deduplicating the rest). Specifically, MEAN selects the stored files with the joint consideration of file popularity, file similarity, and server reliability. MEAN improves the availability of popular files and takes up less extra space through redundancy, while others tend to be deduplicated for space efficiency.
We first formulate this problem and prove its NP-hardness. Thereafter, to reduce the search space in similarity detection, we propose a similarity-aware hierarchical clustering algorithm. Based on this algorithm, we elaborately design a set of heuristic algorithms to determine which files to store and where to place their chunks or replicas. The algorithms are progressively generalized according to three different reliability scenarios. The core insight is to dynamically compare the hit ratio gains of adding replicas of the already-stored files with those of directly storing a new file. In this way, MEAN can provide a trade-off between file availability and space efficiency, thus improving the file hit ratio under the limited storage space.
The examples of MEAN and its comparisons are illustrated in Fig. 1. We assume that the cloud is a conventional deduplication storage system utilizing either inline or offline deduplication. The cloud partitions files into chunks and selects a subset of frequently accessed files to be placed at the edge. The popularity of each file is predetermined, indicating its expected access frequency in the future. For simplicity, we assume that all chunks are of equal size in this example. The deduplication-aware scheme (a) [2] stores the most popular files and allocates their deduplicated chunks across the edge servers evenly. Such a method can improve the file hit ratio. Nevertheless, it is hard to guarantee the stored file available (whose expected file hit ratio is only around 40.7%). We find that production distributed storage systems [22], [23], [24] typically employ replication for fault tolerance. Additionally, some deduplicated storage systems [25] also consider incorporating a fixed number of replicas for unique chunks to ensure their availability. Inspired by this, one possible improvement is to add replicas for each popular file, as shown in Fig. 1(b). Nevertheless, the expected file hit ratio (around Eight files (f 1 ∼ f 8 ), attached with their popularity (9, 2, 1, 4, 1, 3, 2, 1), are partitioned into 12 chunks (c 1 ∼ c 12 ). The storage resources are composed of two edge servers (ES1 and ES2), each with a storage size of five chunks. The reliability of server ES1 and ES2 is 0.7 and 0.6, respectively. The aim is to maximize the hit ratio when storing a part of the files at the edge. 49.7%) is just slightly improved due to the decreased number of stored files. Scheme (c) (i.e., MEAN), by contrast, is a relatively superior solution with a maximum file hit ratio of around 66.2%. It eliminates a part of redundancies to free up the storage space, and the most popular file f 1 is replicated to enhance the file availability. Thus, it can achieve an elegant trade-off between space efficiency and file availability. The strengths of MEAN lie in its ability to select stored files dynamically and decide the location and number of replicas for each stored chunk, considering space efficiency and file availability jointly. The major contributions can be summarized as follows.
r We are the first to consider the problem of implementing deduplication-enabled storage with unreliable edge resources. We propose MEAN to realize a high file hit ratio by reaching a balance between replication and deduplication of data chunks.
r We formulate the data deduplication problem at the unreliable edge and prove its NP-hardness. Efficient heuristic algorithms are designed to generate a feasible solution, based on similarity-aware hierarchical clustering.
r We implement a prototype system of MEAN and evaluate the performance under realistic environments with a realworld dataset. The results show that MEAN can fortify the file hit ratio under unreliable environments by 77%, while reducing the retrieval delay by up to 71%. The rest of this paper is organized as follows. Section II states the related work and motivation. Section III presents the problem formulation and the hardness analysis. Section IV exhibits the heuristic algorithms for the three heterogeneous edge storage scenarios. Section V reports our experimental results, and finally, Section VI concludes this paper. The algorithm of MEAN is available at https://github.com/JX-Xia/MEAN.

II. RELATED WORK AND MOTIVATION
In this section, we first introduce the related work and then present the motivation of MEAN.

A. Related Work
With the explosive growth of digital data, deduplication [17], [18] has attracted increasing attention in large-scale storage systems to realize the space efficiency rationale. The typical chunk-level deduplication process is to split files or data streams into fixed [26], or variable-size [17], [27] chunks and then calculate their fingerprints (e.g., MD5 or SHA256). Only chunks with the unique fingerprint are stored, while duplicate chunks are eliminated. Such an approach can effectively reduce data redundancy and free up a large amount of storage space. Studies conducted by Microsoft [28], [29] and EMC [30], [31] indicate that about 50% and 85% of the data in the production primary and secondary storage systems can be removed using the deduplication technology. In the general file share scenarios, the typical space savings can be up to 50-60% after deduplication [21].
Due to the competitive advantage, the deduplication technology has also been explored for deployment at the network edge in recent years to address space efficiency. It is reported that more than 70% of redundancy in IoT and multimedia data can be eliminated by data deduplication [2], [14], [20]. Li et al. [19] present a collaborative edge-facilitated deduplication technique to balance the deduplication ratio and the deduplication throughput. Luo et al. [32] propose a graph-based approach to maximize the data deduplication ratio with delay constraints. Cheng et al. [33] design a lightweight three-layer hash mapping method to allocate the most similar files into one edge server for better redundancy elimination. However, these efforts do not take into account the file popularity and thus cannot achieve a higher hit ratio with limited edge storage space. In contrast, HotDedup [2] models the file similarities as a δ-similarity graph, and then allocates files with higher popularity at the edge network. In this way, HotDedup allows more popular files to be stored and thus improving the file hit ratio. Since HotDedup is relevant to the work in this paper, we mainly compare our MEAN method with it. Nevertheless, HotDedup is primarily concerned with space efficiency, while edge storage servers are always assumed to be reliable.
Expanding storage space is another way to make the edge hold more files. Due to the diversity of edge resources, many researches propose to expand edge storage space by using various available resources. For example, Pu et al. [10] advocate edge storage in cloud radio access networks to facilitate mobile multimedia services. Liu et al. [9] propose a cost-efficient edge storage system using embedded storage nodes. Various idle resources and reserved resources are further emphasized in literature [3] to achieve cost-effective space expansion. However, these methods are currently not incorporated with the data deduplication technologies. Thus, the scarce edge resources cannot be fully utilized with the duplicated chunks, especially for the stored files with high similarity. In addition, the diversity of resources makes server reliability a challenge. The storage system should deal with unreliable and dynamic resources to ensure file availability. Unlike the existing strategies, our proposed MEAN is pathbreaking in highlighting both the space efficiency and file availability in edge storage. With the elegant trade-off between these two rationales, the file hit ratio can be maximized. More data retrieval requirements can be served at the nearby edge with less service latency.

B. Motivation
While space efficiency and file availability can both effectively improve the file hit ratio, the combination of the two rationals invokes intractable challenges. As illustrated by the example in Fig. 1, data deduplication assists space efficiency, but magnifies the negative impact of data failure in unreliable edge environments. To explore the relationship between the two, we run tests to compare the hit ratio with different server reliabilities and redundancies. For simplicity, we consider the scenario of microservice deployment at the edge, where users or edge servers pull code images from repositories to deploy various container-based microservices. Due to real-time service requirements, these microservices can be short-term and dynamically activated and deactivated, making the data retrieval frequent [34]. We download some popular repositories from the GitHub website [35], as well as some of their most downloaded versions to conduct the test. They are split into chunks using the variable-sized policy [17], which has been widely demonstrated to be more efficient than the fixed-size chunking method [28], [29]. Then, we set up 10 Virtual Machines (VMs) to act as edge storage servers to maintain these repositories. Replication is adopted for fault tolerance, since it is widely proven to have better read/write performance [36], [37].
In each round of experiments, these VMs are randomly shut down according to their reliability. We generate 1,000 retrieval requests on a new VM, which counts as a hit if the required file can be retrieved from these VMs; otherwise, as a miss. The results are based on an average of 100 rounds of experiments. Fig. 2(a) exhibits the hit ratios under different reliability of VMs. The first comparison method is the deduplicated storage without replicas, denoted as NR, where the unique chunks are randomly distributed across 10 VMs. The second is the 3-replica method, denoted as 3R, where each deduplicated chunk maintains 3 replicas across different VMs for fault tolerance. This setting is also implemented by many production storage systems [22], [23], [24] as the default fault tolerance mode. As shown in Fig. 2(a), the hit ratio of NR experiences a rapid decline from 100% to only 7.56% as the server reliability decreases from 1.0 to 0.8. By contrast, the hit ratio of the 3R method still remains at a high level (74.80%, to be specific) when the reliability is dropped to 0.8. Thus, redundancy can have a positive impact on the hit ratio to some extent.
However, if we supplement the chunk replicas arbitrarily, the extra occupied space of the redundancies would crowd out the original stored content, which is not conducive to the increase of the hit ratio. Therefore, we further conduct tests to observe the impact of the number of replicas. The results are shown in Fig. 2(b). In this set of tests, the total storage space of the 10 VMs is set to 40% of the dataset size, and the reliability of each VM is fixed as 0.8. The file hit ratio grows initially when one more replica is assembled for each unique chunk. However, the excessive redundancy unnecessarily consumes a significant amount of storage space, thereby reducing the capacity of the cluster to hold more popular files. As a result, the file hit ratio declines from 28.07% to only 13.42%, when the number of chunk replicas increases from 2 to 5.
From the above test results, we conclude that server unreliability can have a significant negative impact on the hit ratio of deduplicated-enabled storage, while replication is a double-edged sword. Therefore, the trade-off between deduplication and replication should be delicately balanced when implementing deduplication-enabled storage with unreliable storage resources. This problem is intractable, and its complexity would be multiplied when the heterogeneity of file popularity and server reliability is further considered. In addition, although some studies [25], [38] have emphasized the importance of fault tolerance in deduplication storage systems and utilized hash-based mechanisms such as CRUSH or DHT algorithms to determine file or chunk placement, they fail to take into account factors such as file popularity and heterogeneous server reliability, which are crucial for making an elegant balance between deduplication and replication. For example, to enhance the hit ratio, we prioritize adding more replicas to highly popular files to enhance their fault tolerance. We also conduct the deduplication technique for the space-saving purpose, which helps to accommodate more files at the resource-limited edge. The combination of these two achieves an elegant balance between redundancy and deduplication, which promotes the file hit ratio. However, these optimizations cannot find the corresponding operations for the hash-based methods. Therefore, they cannot be directly employed in our scenarios to enhance the hit ratio for edge storage.
To this end, this paper presents MEAN, a deduplicationenabled storage system at the unreliable edge. MEAN leverages replication to enhance file availability, while the deduplication technology is also assembled to eliminate unnecessary redundancies for space efficiency. As far as we know, it is the first work to enhance the hit ratio with joint consideration of both space efficiency and file availability for edge storage. To address this, our MEAN supplements the chunk replicas according to the popularity of different files and data-sharing dependencies with the existing stored content. The chunk locations are also considered to further promote the file availability with heterogeneous server reliabilities.

III. PROBLEM FORMULATION
In this section, we first formulate the deduplication storage problem in Section III-A. With the problem being defined, we analyze and prove the problem hardness in Section III-B.

A. Problem Formulation
We assume that there is a collection of files F = {f 1 , f 2 , . . .} that are frequently fetched by users. Their popularities are estimated through statistical analysis within a specific time frame, denoted as H = {h 1 , h 2 , . . .}. A higher level of popularity indicates that the file will be accessed more frequently for a period in the future. There exist many studies that have extensively explored methods to predict the popularity values of files [39]. All these methods can be applied to our work. Since this is not the main concern of this paper, we assume that they are known in advance. Let C = {c 1 , c 2 , . . .} be the set of unique chunks (after data deduplication) that are partitioned from files in F . The Boolean variable x i,j indicates an inclusion relation, where x i,j = 1 means that chunk c j is included in file f i . A part of files in F would be stored at the edge to facilitate data requests and reduce retrieval delays. This constitutes a two-tier storage architecture: the cloud data center keeps the whole set of files and is considered infallible, while the edge cluster stores only some popular files and may suffer from data loss. Due to the high bandwidth and low latency of edge storage, file requests are preferentially responded to by the edge servers. If the file is not stored at the edge, or if the server fails and data is lost, the request is further forwarded to the cloud data center. The deployment of edge storage can be performed during off-peak hours to reduce the traffic pressure on the backbone network [3], [4]. The service provider can specify the update intervals between two deployments to achieve a balance between traffic overhead and service performance. The edge metadata overhead due to deduplication mainly comes from two aspects. The first is called file recipes, which are effectively a list of per-chunk metadata as they appear in each file stored at the edge. If one chunk exists in a file multiple times, its metadata also occurs multiple times in its file recipe. This helps file reconstruction in the data retrieval process. The second records the mappings between chunks and edge servers, which helps to retrieve the specific chunks based on the recorded address.
The edge resources are composed of a set of edge servers S = {s 1 , s 2 , . . . }. The storage capacity of server s k is denoted by cap(s k ). Let Boolean variable y j,k indicate whether chunk c j is stored at the edge server s k . We do not consider the case that a chunk or file is replicated several times at one server, because it has no effect on the access shunt but only aggravates data redundancies. Let size(c j ) denote the size of chunk c j , then the storage cost of server s k is the total size of the chunks stored on it, i.e., size(s k ) = c j ∈C y j,k · size(c j ). Note that, this size function generates a constant value for the fixed-size chunking [26], Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. and varies for the variable-size chunking algorithms [17]. The specific definitions are summarized in Table I. A file request can only be hit when all of its referenced chunks are available at the edge. This depends on two critical preconditions. The first is that all referenced file chunks are stored at the edge. We let α i indicate whether this precondition is satisfied.
(1) where the Bool function returns "1" when its variable is not zero.
The second precondition is that, for |S| k=1 y j,k servers holding chunk c j , there must be at least one available server when retrieving file f i with x i,j = 1. We use R = {r 1 , r 2 , . . .} to denote the reliability of each edge server. In this paper, we consider that the unreliability of edge servers mainly stems from two aspects. One is the inherent properties of edge servers. For instance, service providers may deploy some inexpensive servers at the edge to reduce costs, and their reliability can typically be inferred from historical data or equipment manufacturers. Another type of unreliability stems from the fact that these storage resources are either idle resources provided by various enterprises and individuals or proprietary resources reserved for other applications that have not been fully utilized yet. A distinctive feature of such resources is that these storage resources may be reclaimed at any time by the owners or used for other applications. The reliability of such resources can be ascertained through prior negotiation, since obtaining permission to use these resources requires consent from their owners. Therefore, we assume that the reliability of edge servers is predetermined in this paper. Let P (f i , x, y, R) indicate the file availability of f i under the server reliability R. It depends on the data sharing dependencies between the stored files (Boolean x), and is deeply associated with the locations of its contained chunks and their replicas (Boolean y). There is currently a lack of closed-form quantification, but the state interrelations of P (f i , x, y, R) can be estimated coarsely as the reliability product of the stored servers without loss of generality, i.e., where S(i) denotes the minimum set of servers that can cover all chunks of file f i . With the aforementioned Boolean variables about files and chunks, we can formulate the deduplication storage problem as follows.
r When a file is stored (α i = 1), all of its partitioned chunk c j should have at least one replica at the edge: r When the chunk c j is not referenced by any stored file, it is unnecessary to store c j at the edge. In addition, for any necessary chunks, there are at most |S| replicas across the |S| edge servers r The total size of chunks stored on each edge server cannot exceed its storage capacity r The state variables are all Boolean We develop the optimization objective of our MEAN scheme, i.e., maximize the file hit ratio, as follows: This requires an elegant trade-off between space efficiency and file availability rationales. The space efficiency is described as maximizing the number of stored files, i.e., α i . The file availability can be represented by maximizing the reliability of each store file, i.e., P (f i , x, y, R). In addition, placing popular files at the edge can serve more data requests per unit of time, i.e., h i for file f i , which further augments the file hit ratio. Then the deduplication storage problem can be formulated with (7) as the objective and (3)∼(6) as the constraints.

B. Problem Analysis
In this subsection, we analyze that the defined problem is NP-hard by proving the hardness in a particular case, i.e., there is no redundancy between the files, and all servers are reliable.
Theorem 1: The problem of deduplication storage with unreliable resources is NP-hard.
Proof: We prove that the problem is NP-hard by showing that a special version of the proposed storage problem can be simplified as the knapsack problem, which is known to be NP-hard [40]. The special case is when zero redundancy exists between any pair of files, and all servers are reliable. Specifically, we consider a set of |F | items, each of size size(f i ) and associated with a reward h i . The knapsack size is set as M . The knapsack problem aims to find a subset SF ∈ F of the items that have a total size no greater than M and achieves maximum reward, i.e., to maximize i∈SF h i under i∈SF size(f i ) ≤ M . This knapsack problem can directly correspond to the simplified problem when no duplicated chunks exist between any pair of files and the storage resources are all reliable. Thus, the server capacities can be directly quantified as s k ∈S cap(s k ) = M . As a result, the knapsack problem can be exactly viewed as a special case of the proposed deduplication storage problem, which implies NP-hardness.
Although the knapsack problem has been extensively studied, the deduplication storage problem in this paper is much more complex than the knapsack problem. The similarity of files can significantly increase the complexity, and the placement of chunks can lead to different file reliability, which further complicates the problem. Therefore, we provide efficient heuristic algorithms based on the similarity-aware hierarchical clustering in Section IV to enhance the file hit ratio for the deduplicationenabled storage in unreliable environments.

IV. THE MEAN METHODOLOGY
In this section, we first present similarity-aware hierarchical clustering (SHC). Based on this, we propose the MEAN methodology according to three different reliability scenarios. The three scenarios are progressive, where the latter is a generalization of the former.

A. Similarity-Aware Hierarchical Clustering
The algorithms of MEAN are derived from the greedy idea that the storage scheme with the highest gain is selected at each step until the storage space is full. We define a ranking index h · Δp/Δc, which indicates the gain of the hit ratio per unit of storage space. The files with higher ranking indexes are more beneficial to be stored at the edge, where h indicates the file popularity, Δp represents the increment of file availability, and Δc denotes the extra space cost. The popularity of a file can be estimated based on its access frequency within a specified period [39], which is assumed to be predetermined in this paper. Δp is calculated by comparing the file's reliability before and after adding new chunks to the storage system, which is discussed in detail in Section IV-B2. Δc denotes the data size required for storage by the system. However, two main challenges exist when employing the ranking index directly for searching candidate files. The first is that the Δc value is determined based on the difference with the existing stored data. This makes it hard to select a set of closely related (with a great portion of shared chunks) but big-volume files, because there are few co-existing data chunks with the stored files. The second challenge is that the searching process is time-consuming for the numerous file candidates, especially when calculating the Δc value by comparing the contained chunks between the file candidates and the stored content. Besides, the calculation is repeated because the Δc value would be updated after each decision with the stored chunks increase. To handle these two challenges, we first propose a Similarity-aware Hierarchical Clustering (SHC) method, which clusters multiple closely related files in advance. This can improve the chance for big-volume files to be selected through deduplication among them. Then, we design an acceleration scheme based on Bloom Filter to reduce the time complexity of file comparison.
1) Clustering Based on Ranking Index: We show the example in Fig. 3(a) to illustrate the first challenge, assuming that all chunks have the same size of one unit. For simplicity, we ignore the server reliability here. The edge storage capacity is assumed to be 5 chunks. Based on the ranking index, file f 1 will be selected first, with the maximum ranking index value of h/Δc = 3/1 = 3. Since chunk c 1 has been chosen, files containing this chunk will have a Δc value smaller than their actual size in the following computation. Therefore, files similar to file f 1 can have a higher probability of being selected to be stored at the edge. Therefore, file f 2 is chosen, which only needs to store two new chunks c 3 and c 4 , with ranking index value h/Δc = 6/2 = 3. In the same manner, file f 3 will be selected in the next round, which requires a new chunk c 2 to be stored. As a result, file f 1 , f 2 , and f 3 would be successively selected to be stored at the edge with the total popularity of 3 + 6 + 3 = 12. However, a superior solution can be to store files f 4 , f 5 , and f 6 , with a total popularity of 6 + 8 + 5 = 19. The reason for this difference is that files f 4 , f 5 , and f 6 , which are popular and closely related, cannot be detected. Each of these files consists of multiple chunks and therefore has a large Δc value. Thus, each of them has a small ranking index value, making them ignored by the ranking-based heuristic.
To improve the performance of the heuristic, we propose a similarity-aware hierarchical clustering. Hierarchical clustering [41] is an iterative clustering process. In each iteration, it merges the most similar pair of clusters or files (share a large portion of their chunks) into a new cluster. Considering the popularity of different files, we improve it in a popularity-driven manner, i.e., the most similar cluster pairs are merged only if the combined ranking index is greater than each previous one. The updated popularity h is the sum of that of the two original clusters, while c is the size of their union set. The reliability of the updated cluster should be recalculated based on the locations of the chunks in this union set (see Section IV-B). In this way, the number of generated clusters is much fewer than the number of original files, thus significantly decreasing the computation complexities in comparing the ranking indexes. In addition, the updated index value of the cluster is generally larger than before, because the value of c can be greatly reduced after data deduplication. This facilitates the detection of large-volume but closely-related files, like f 4 ∼ f 6 in Fig. 3(a).
The similarity-aware hierarchical clustering relies on a similarity function that indicates which pair of clusters to merge in each iteration. For this purpose, we use the commonly used Jaccard similarity coefficient [42]. For two clusters A and B, the Jaccard similarity coefficient of them is defined as J(A, B) = |A ∩ B|/|A ∪ B|. To derive the intersection set and union set of the two clusters, an intuitive method is to compare the chunk fingerprints (e.g., using MD5 [43] or SHA-1 [44] coding). The intersection of two files can be calculated as the sum of the sizes of chunks in both files that have the same fingerprint. The union of two files is the sum of the sizes of their unique chunks. In each round, we can choose the two files with the largest Jaccard similarity coefficient to merge. If the merged ranking index is larger than either of the previous ones, the two files are merged to generate a new cluster. Otherwise, the Jacquard index between them is set to 0, and the two files are not considered to be merged in the following rounds. Take Fig. 3(a) as an example, the Jaccard similarity coefficient of files f 2 and f 3 is J(f 2 , f 3 ) = 2/4. After merging them, the ranking index value is (6 + 3)/4, which is greater than each of their values, i.e., 2 and 1, respectively. Therefore, files f 2 and f 3 can be merged to generate a cluster ϕ 1 . Thereafter, ϕ 2 and ϕ 3 are sequentially constructed, resulting in a final set of three clusters: {f 1 , ϕ 1 , ϕ 3 }, as shown in Fig. 3(b). Therefore, we can choose the cluster ϕ 3 with the maximum ranking index to store at the edge. The total popularity is 6 + 8 + 5 = 19, which is greater than that of the basic heuristic, i.e., 12 as aforementioned. It is worth noting that the maximum number of clustering hierarchies can be preset so that the cluster sizes are within a reasonable range. For the example in Fig. 3, we can stipulate that the files are merged at most once. Thus, SHC outputs only two clusters, i.e., ϕ 1 and ϕ 2 . When the total storage space is 4 chunks instead of 5 chunks, this approach can still achieve elegant performance.
2) BF-Based Sketch and Acceleration: Although hierarchical clustering can reduce the space of comparison, the calculation of the Jacquard index may still bring significant time overhead. For example, for two clusters ϕ 1 and ϕ 2 with |ϕ 1 | and |ϕ 2 | chunks, it takes O(|ϕ 1 | × |ϕ 2 |) time-complexity to determine the number of shared chunks. To further decrease the computation complexity, we adopt Bloom Filter (BF) [45], [46] to sketch the fingerprints of chunks in each cluster. The basic BF is a hashing mapping method that has been widely utilized in various networking and distributed systems. We transfer a similar idea to the process of file comparison and propose the BF-based sketch. This facilitates the computation of the Jaccard similarity coefficient from pair-wise fingerprint checking to the membership queries on the cluster sketches.
When calculating the intersection set between cluster ϕ i and ϕ i , it would first require the BF vector of ϕ i . For any chunk c j in ϕ i , the BF judges that this chunk does not belong to ϕ i , if any bit at the k BF hashed positions in that BF vector is 0, where k BF denotes the number of hash functions used by BF. Otherwise, the BF believes that the queried chunk c j belongs to ϕ i with a rate of false positives. Then, the belonged chunks of ϕ i are exactly the intersection set between ϕ i and ϕ i . Fig. 4 provides an illustrative example of the BF-based sketch. Given a cluster with the chunk set ϕ = {c 1 , c 2 , c 3 }, the BF represents ϕ with a bit vector of length l BF = 19. All l BF bits in the vector are initially set as 0. The k BF = 2 independent hash functions are employed to map each chunk into k BF positions in the bit vector. Those hit positions would be all set to 1. The binary string derived from the hash functions is exactly the BF-based sketch. These sketches would be updated with the XOR operations when the clusters are merged. In this way, each cluster maintains a bit vector to record the membership information at the chunk level. According to the mapped bit vector and the utilized hash functions, we can realize the lightweight membership queries against any clusters, which accelerates the computation for the Jaccard similarity coefficient. To achieve this, we maintain two variables, d ∩ and d ∪ . The former is initially 0, while the latter is the sum size of the two clusters. When calculating the Jaccard similarity coefficient of a cluster ϕ i with the current cluster ϕ i , we only need to hash each chunk of cluster ϕ i to the BF-based sketch of ϕ i in turn. If the query finds that the chunk has been recorded, the value of d ∩ is updated to the original value plus the size of the chunk, and the value of d ∪ is updated to the original value minus the size of the chunk. In this way, the Jaccard similarity coefficient of the two clusters can be represented as d ∩ /d ∪ . The time complexity can be decreased as O(|ϕ| · k BF ), where k BF indicates the number of utilized hash functions.
The penalty of such an approach is the false positive, i.e., for any chunk c ∈ C, all of its k BF hash positions in the bit vector may be set as 1 when representing other chunks in set C. This is caused by the unavoidable hash conflicts, as the 8th bit in Fig. 4. The false-positive probability, denoted as p, can be derived by p = (1−(1−1/l BF ) n·k BF ) k BF [45], where n represents the number of represented chunks in set C.

B. SHC-Based Heuristics for Different Scenarios
SHC elaborates on a feasible and effective method to accelerate cluster generation and index calculation. Based on this, we propose effective heuristic algorithms to improve the file hit ratio in deduplication-enabled storage at the unreliable edge. We consider three heterogeneous scenarios to develop the algorithms of MEAN, where the former scenario is a special case of the latter: 1) all servers are reliable (Section IV-B1); 2) all servers have the same reliability (Section IV-B2); 3) servers are with heterogeneous reliabilities (Section IV-C1).
1) Scenario One: All Servers are Reliable: When all edge servers are reliable, i.e., r 1 = r 2 = · · · = 1, it is unnecessary to Algorithm 1: Heuristic for Scenario One. maintain chunk replicas because each chunk is available without the risk of server crashes. Therefore, the ranking index can be directly simplified as h · Δp/Δc = h · Δc. Besides, the location of chunks would no longer affect the file availability, because users can retrieve these chunks no matter which edge server they are placed on. To this insight, we only need to consider how to store more popular files with limited storage space, regardless of the mapping between each chunk and the edge server. The specific algorithm is detailed in Algorithm 1.
The algorithm's inputs include the generated set of clusters Φ from SHC and the set of edge servers S. For simplicity, the file that is not clustered is also treated as a cluster. The objective is to select a part of clusters to be stored at the edge, with the objects shown in (7). We first select an initial cluster ϕ init with the maximum ranking index (Line 1). Thereafter, we calculate and update the ranking index h/Δc for all candidate clusters ϕ i ∈ Φ, where Δc i = size(ϕ i −Ω ∩ ϕ i ) is derived from the intersection operation between the set Ω and the current cluster ϕ i . Based on this, we select the cluster with the maximum ranking index consecutively, until the size of set size(Ω) reaches the total storage capacity s k ∈S cap(s k ) (Lines 4-10). The set Ω and candidate clusters Φ are also updated in each round of cluster selection (Lines 9-10). At last, the set of selected clusters Ω would be distributed to the edge servers randomly with the constraint of their storage capacities (Line 11).
2) Scenario Two: Homogeneous Reliability: In contrast to Scenario One, the impact of server reliability is further taken into account. As edge servers are no longer assumed to be reliable, files stored on them may be susceptible to loss. Therefore, enhancing file availability by incorporating chunk replicas is necessary.
Specifically, when all servers have the same reliability, i.e., r 1 = r 2 = · · · = r, a critical measure to enhance the file availability is to hold chunks of a file on as few servers as possible. The reason is that, as the number of servers in S(i) (the minimum set of servers that can cover all chunks of file f i ) reduces, the file availability can be strengthened, as shown in (2). By contrast, when chunks of a file are distributed across many servers, the file can be acquired only if all of these servers are available. Based on this, an effective approach to improving the availability of a file is to place all its chunks on the same server. However, a significant drawback of this approach is that it can result in a significant number of redundant chunks being repeatedly stored across multiple servers, which is not conducive to the edge cluster storing more files.
We design a dynamic trade-off solution to tackle such a conflict. The basic idea is that, for the file with high popularity, we tend to store all of its chunks in the same server to improve reliability. If some files are extremely popular, we can even keep their replicas on multiple servers to further address the file availability. By contrast, less popular files can be stored using data deduplication, where only the unique chunks that make up these files are added to the edge cluster (which may not reside on the same server), thereby enabling the edge cluster to accommodate more files. We further propose three judging metrics to determine the specific storage solution for each file according to three different cases. In addition, the mapping relationship between chunks and servers should be taken into account in this scenario, i.e., which chunks are stored by each server. To this end, we number these servers beforehand and then determine which chunks each server should store. Since the reliability of servers is assumed to be consistent, we ignore the effect of the numbering sequence on the results. For the first server, it dose not know what chunks the other servers will store. Therefore, we still use the method presented in Algorithm 1 to determine which chunks it should store. Starting with the second server, we show three storage cases of how to select and store the file/cluster dynamically as follows.
r Store the remainder of a new cluster ϕ i . This storage scheme suggests that we store a new cluster at the edge with deduplication. The space overhead can be denoted as where Ω and Ωk represents the involved chunks that have been stored at previous servers, and the remainder that will be stored at the current server sk, respectively. In addition, we can calculate the minimum number of servers that have stored the cluster's involved chunks, which is denoted as θ (θ = 0 when there is no involved chunk on other servers). Thus, the cluster can be retrieved with the reliability of r θ+1 , since retrieving this cluster requires all involved servers (including the current server) to be available.
r Store the chunks of cluster ϕ i that are already stored on previous servers to the current server. This storage scheme suggests that we should hold the entire data of an alreadystored cluster ϕ i at the current server, rather than spreading its chunks across multiple servers with deduplication. This enables cluster ϕ i can be retrieved from a single server. Notably, this scheme is the reverse process of deduplication in the first case, rather than storing a new cluster. It improves the availability of an already-stored cluster ϕ i from r θ+1 to r, and the space overhead, i.e., data that is stored repeatedly, can be represented by Δc i = size(ϕ i ∩ Ω).
r Add a full replica of an already-stored cluster ϕ i . This storage scheme suggests that we should add a full replica of cluster ϕ i at the current server, instead of storing a Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Currently, we need to determine which chunks are stored on the server ES3. The dashed chunks represent the chunks that will be stored next, and the solid chunks represent the ones that have been stored.

Algorithm 2:
Heuristic for Scenario Two.
new cluster at the edge. For particularly popular files, this method can effectively improve their availability. The space overhead of such a scheme can be calculated as . The availability of cluster ϕ i can be increased from 1−(1 − r) λ to 1−(1 − r) λ+1 , where λ represents the number of replicas of cluster ϕ i at the preceding servers.

C. Time and Space Complexity
The three storage cases are illustrated in Fig. 5, where a new file/cluster ϕ is required to be stored at the edge. Currently, we should determine how to store the file/cluster ϕ on the edge server ES3. In the first case, we consider storing a new cluster ϕ at the edge by only adding the un-stored unique chunks into the edge cluster. Since the involved chunks c 1 and c 2 are already stored, we only need to store chunk c 3 on the current server to achieve space efficiency. The limitation of this scheme lies in the fact that retrieval of the file/cluster is contingent upon the availability of all participating servers. We assume that each server has a reliability of r. Thus, the increased availability Δp of cluster/file ϕ can be calculated as r θ+1 − 0 = r 3 , and the storage cost Δc is the size of chunk c 3 . In the second case, we consider storing a file/cluster ϕ completely on the current server, i.e., replicating chunks c 1 and c 2 on the current server ES3. In this way, although the redundancy is increased, the reliability of file/cluster ϕ is improved. The increased availability Δp of file/cluster ϕ can be approximately computed as r − r 3 . It is worth noting that if the file/cluster is stored this way, we will remove it from the set Φ P and record it in the set Φ F , which indicates this file/cluster is fully stored on a single server. In the third case, we consider adding a full replica of an already-stored file/cluster in the set Φ F to the current server ES3. If a file/cluster is stored using this scheme, it indicates that the file/cluster may have a high popularity and therefore is not suitable for distributed storage. In order to ensure high availability, all chunks of this file/cluster replica should be held on a single server. In this way, as long as any involved server is available, the file/cluster ϕ can be obtained from the edge, which further enhances the availability of popular files. The increased availability Δp can be calculated as where two replicas of ϕ have already been stored on the servers ES1 and ES2.
In particular, considering more cases can potentially promote better solutions, but can bring significant iteration time. Therefore, we mainly consider the above three storage schemes for each file, where the gains of hit ratio are most significant. After evaluating these three storage cases for each file/cluster, we can select the optimal file/cluster and the corresponding storage scheme to maximize benefits and implement it on the edge server. This iterative process sequentially selects one storage scheme at a time until all servers have reached their maximum storage capacity.
The specific procedure is elaborated in Algorithm 2. The input bears resemblance to that of Algorithm 1, while the output determines the stored content at each edge server. We select files to store at the first server according to Algorithm 1 directly (Line 2). Subsequently, we compute the ranking index for each remaining server across all potential clusters in three cases. The cluster with the highest ranking index signifies that its corresponding storage method yields greater benefits by improving the hit ratio while minimizing storage costs. Therefore, we select the cluster with the highest ranking index consecutively, until the size of the stored data reaches the storage capacity of each server (Lines 4-16).
1) Scenario Three: Heterogeneous Reliability: When all servers have heterogeneous reliabilities, i.e., r 1 = r 2 = . . . r |S| , placing files on different servers can result in distinct file availability. The files with higher ranking indexes should be stored on more reliable servers, since such files tend to yield higher storage gains with less space overhead. Otherwise, the availability of these popular files can only be ensured with replicas across multiple unreliable servers, which extra occupies a large volume of precious storage resources. To this insight, we sort these servers in descending order according to their reliability. Starting with the most reliable server, we progressively select files based on improvements to the scheme in Scenario Two. The three different cases in this scenario are presented as follows.
r Store the remainder of a new cluster ϕ i . The space overhead can be denoted as Ωk)). The file availability can be derived by rk · Π s k ∈θ i r k , where the minimum number of servers that have stored its involved chunks is denoted as θ i (θ i = 0 when there is no involved chunk on other servers).
r Store the chunks of cluster ϕ i that are already stored on previous servers to the current server. It improves the availability of an already-stored cluster ϕ i from rk · Π s k ∈Θ i r k to rk, and the space overhead, i.e., data that is stored repeatedly, can be represented by r Add a full replica of an already-stored cluster ϕ i . The space overhead can be calculated as where Λ i represents the set of previous servers that store the replicas of ϕ i . The algorithm in this scenario is quite similar to Algorithm 2. The difference is that servers in the set S should be pre-ordered based on their respective reliabilities, and Δp i in lines 7, 9, and 11 should be replaced with the above three formulas. Therefore, the algorithm is omitted here. Furthermore, it is worth noting that MEAN can actively create replicas for highly popular files. This increases their availability, and the replicas also help avoid server hot spots. In this way, retrieval requests for popular files can be effectively balanced across multiple servers, thus avoiding the overload on a single server.
We analyze the time and space complexities of the above algorithms in this subsection, with the results being shown in Table II. The similarity comparison between two files/clusters adopts the BF-based sketch. The time complexity of the similarity-aware hierarchical clustering is O(|F | 2 · |C F | max · k BF ), where |F | denotes the number of files, k BF indicates the number of utilized hash functions for the BF-based sketch, and |C F | max represents the maximum number of chunks of any file. The space complexity is O(|F | · l BF ), where l BF expresses the BF length.
The time complexity of the heuristic algorithm in Scenario One is O(|Φ| 2 · |C Φ | max · k BF ), where |Φ| denotes the number of generated clusters and |C Φ | max indicates the maximum number of chunks in any cluster. The space complexity is O (|Φ| · m). For Scenario Two, the time complexities are multiplied by S, while |S| · l BF additional space is occupied with recording the integrated sketches for data storage at the |S| servers. As for Scenario Three, the space complexity |Φ||S| is caused by maintaining a list of possible storage servers for each cluster.

V. PERFORMANCE EVALUATION
In this section, we implement a prototype of MEAN and evaluate the performance using a real-world dataset. We describe our experimental settings and then present the numerical results of different methods.

A. Experimental Settings
We implement a prototype system of MEAN to evaluate the performance in real-world environments. The prototype includes a cloud and an edge cluster to simulate the file retrieval behavior of edge storage. This constitutes a two-tier storage architecture, where requests for files are first responded to by the edge storage cluster, and the missed requests are further forwarded to the cloud. The edge cluster stores a subset of popular files, while the cloud keeps a complete backup of all files. In our prototype system, the cloud is deployed on the Elastic Compute Service (ECS) of Alibaba Cloud [47], which is equipped with 2.5 GHz 8 vCPU, 16 GB RAM, and 40 GB SSD. The ECS runs Ubuntu Linux 16.04 x64. The edge storage consists of 11 VMs deployed on a Desktop PC, equipped with a 3.50 GHz Intel(R) Core(TM) i9-11900 K CPU with 8 cores and 64 GB RAM using 500 GB SSD. Each VM is allocated 4 GB of RAM and 30 GB virtual disk drive, running Ubuntu Linux 20.04 x64. The CPU cores are shared by all VMs. We use the iPerf and ping tools to measure the network performance. The bandwidth between the ECS and the local VMs is 91.6 Mbps, and the latency is 29.05 ms, while the bandwidth between any two local VMs is 1.24 Gbps, and the latency is 1.43 ms, based on the average of 10 measurements.
In our experiments, 10 VMs act as the edge storage servers, and the remaining one acts as a data requester to retrieve files from these edge servers or the cloud. This VM also acts as a management node, which is responsible for monitoring the status of other edge servers through heartbeat packets among them. The management node maintains the mapping information between each file and its chunks, as well as the mapping between each chunk and servers in the edge cluster. When the requested file cannot be served by the edge cluster due to failures, the request is further forwarded to the cloud server by the management node.
Datasets: We evaluate the performance of MEAN using two real-world datasets from the GitHub website [35], which are extracted from 357 popular repositories. The first dataset comprises code images in the .zip format, referred to as SRC. The second dataset encompasses released installers in formats such as .rpm, .deb, .apk, etc., denoted as RLS. These repositories are selected randomly under some hot topics, e.g., Azure, Amazon Web Services, Docker, etc. We download multiple popular versions from each repository. The SRC dataset comprises a total of 3,099 files with file sizes ranging from 2.74 KB to 12.6 MB. Meanwhile, the RLS dataset contains 1,617 files with file sizes varying between 3.26 KB and 24.1 MB. We chunk these files using the variable-size chunking method [17], which declares chunk boundaries based on the byte contents and is widely proven to be more efficient than the fixed-size chunking method [28], [29]. The SRC and RLS datasets exhibit an average chunk size of 4.07 KB and 3.84 KB, respectively, with deduplication ratios (the size of duplicated data divided by the total size, where a larger value indicates that deduplication can eliminate more redundancy) of 53.01% and 26.03%. The popularity of each file is generated with the widely utilized Zipf distribution [18], [48].
Comparison Methods: We implement the following comparison methods to determine which files the edge cluster stores.
r HotDedup, which is an implementation of the HotDedup algorithm [2]. The popularity of the stored files is maximized with the capacity constraints of edge servers. These files are deduplicated in a global sense, and the unique chunks are distributed across the edge servers evenly.
r PopF, which selects the most popular files to store at the edge. Such a Popularity-First strategy is widely adopted by many edge storage systems [4], [5], [6]. We improve this strategy by emphasizing file availability and space efficiency. All chunks of the selected file are stored on one server and deduplication is used to eliminate duplicate chunks on each server. r PopF_3R, which is assembled with the replication theory to enhance file availability based on the PopF method. The number of replicas is set to 3, which is the default value in many production distributed storage systems [22], [23], [24].
r Cloud_only, which retrieves all requested files from the cloud, regardless of the edge storage. Settings and Metrics: The results are based on the average from 10 rounds of experiments. The default reliability of the 10 edge servers is set as [0.8; 0.5; 0.7; 0.7; 0.8; 0.6; 0.5; 0.9; 0.4; 0.6], and their total storage capacity defaults to 20% of the dataset size. In each round, we evaluate the performance of different comparison methods by randomly generating 500 file retrieval requests based on their popularity. Then, we randomly shut down part of the edge servers according to the reliability and retrieve files in the retrieval list according to the Poisson distribution. The arrival rate λ is set as 90 by default, which indicates the expected number of retrieval requests in one minute. Our metrics include the file hit ratio, the average retrieval delay, and the average retrieval throughput. Besides, we also consider the balance of service load for the involved edge servers. In each round, we record the total amount of data sent by each server as its service load. Then, we calculate the balance metric , which is defined as the deviation between the maximum and the average service load.

1) Performance at Varying Levels of Reliability:
We first evaluate the sensitivity of different methods to server reliability. To facilitate comparison, we maintain uniform reliability levels across all edge servers and vary this parameter from 0.5 to 1.0. This corresponds to the scenario of homogeneous reliability. When the reliability is set as 1.0, MEAN adopts the algorithm of Scenario One in Section IV-B1. Other reliability settings correspond to the algorithm in Scenario Two, as illustrated in Section IV-B2. The total storage space is set to the default value, i.e., 20% of the dataset size. The results are presented in Figs. 6 and 7, with the former presenting the average file hit ratio across both datasets, while the latter displays the average retrieval delays and throughput for the SRC dataset.
The enhancement of server reliability positively impacts file hit ratio and retrieval performance, while low reliability results in varying degrees of performance degradation for these methods. HotDedup is the most sensitive to server reliability, whose hit ratio is at a low value when the reliability of servers is below 0.9, as shown in Fig. 6. Because of this, the average retrieval delay is just slightly lower than that of Cloud_only, with more than 0.7 seconds per file (in Fig. 7(a)), and the throughput is below 73 Mbps (in Fig. 7(b)). The optimal hit ratio can only be attained under the condition that all servers are reliable. The main reason is that HotDedup assumes that these edge servers are reliable, Fig. 8. Impact of different storage capacities on file hit ratios across two datasets. and the chunks of each file are evenly distributed across these servers. Such an approach is not friendly to file availability. The failure of any one server can make a large number of associated files unavailable, which significantly impacts the file hit ratio.
The PopF and PopF_3R methods exhibit a superior hit ratio compared to HotDedup in unreliable environments, resulting in better retrieval delay and throughput performance when server reliability falls below 0.9, as shown in Fig. 7. In particular, PopF_3R exhibits only a slightly superior hit ratio to PopF when the server reliability is at 0.5. This can be attributed to the fact that the three-way replicas policy further bolsters file availability in an unreliable environment, with each popular file being able to withstand up to two server failures. However, such a fault-tolerant approach also brings disadvantages due to extra space occupation. With the growth of server reliability, the PopF method surpasses PopF_3R. When the server reliability exceeds 0.9, there is a significant difference in hit ratio of over 20% and a throughput difference of more than 50 Mbps between the two methods, as illustrated in Figs. 6(a) and 7(b). The reason for this is that with the improvement of server reliability, the advantages of replication decrease. Maintaining three replicas requires a significant amount of space, which limits the number of files stored at the edge and results in a high volume of requests that must be served by the cloud. In contrast, MEAN demonstrates superior file retrieval performance across a wide range of reliability settings due to its ability to efficiently execute data deduplication while adjusting the number of chunk replicas for varying reliability scenarios. However, since the RLS dataset has a lower deduplication ratio than the SRC dataset (i.e., fewer duplicated chunks), the gap in hit ratio between the MEAN and PopF methods is smaller in the RSL dataset compared to that in the SRC dataset, as Fig. 6 shows.
2) Performance Under Different Storage Capacities: We set varied storage capacities to evaluate their impact on file retrieval performance. The total storage capacity of edge servers is increased from 5% of the dataset size to 30%, with the server reliability remaining in default. MEAN adopts the algorithm of Scenario Three, as described in Section IV-C1. The results are presented in Figs. 8 and 9. Fig. 8 depicts the impact of storage capacity on the average file hit ratio. MEAN consistently has the highest hit ratio, since it considers both space efficiency and file availability, followed by PopF and PopF_3R. The PopF method has a higher hit ratio than PopF_3R when the edge storage capacity is low. Nevertheless, when the capacity exceeds 25%, the PopF_3R method achieves the reverse in the SRC dataset. This is because when the storage space is large enough, replication can increase the file availability and cope with more server failures. However, neither of the two methods can effectively exploit file similarity to increase space efficiency. Thus, their hit ratios are lower than MEAN under the same server capabilities. In contrast, HotDedup exhibits a relatively low hit ratio, with only approximately 10% in the SRC dataset and close to zero in the RLS dataset. This is due to the fact that the failure of any involved server of a file can easily result in a missed hit. When storage capacity reaches 30%, MEAN and HotDedup differ significantly in their average file hit ratios, with up to a 77% discrepancy observed across both datasets. Fig. 9 reports the average retrieval delay and throughput in the SRC dataset. When all files are retrieved from the cloud, the average delay is around 0.9 seconds with a throughput of about 19 Mbps. Storing files at the edge can significantly reduce the average retrieval delay. With the storage capacity only accounting for 5% of the dataset size, MEAN can decrease retrieval delays by over 50%, to around 0.43 seconds, while the average retrieval throughput can reach up to 135 Mbps. As the storage capacity increases, this gap gradually widens. Specifically, when the storage capacity accounts for 30% of the dataset size, MEAN exhibits an average retrieval delay of approximately 0.15 seconds. This represents a reduction in retrieval time by 83% compared to cloud-based file retrieval and a reduction of 71% compared to HotDedup. Furthermore, this delay is roughly half of that observed with PopF and PopF_3R methods.
3) Load-Balancing Performance: To facilitate comparison, we utilize the SRC dataset as a benchmark for evaluation in subsequent experiments. The load-balancing performance of the compared methods under the SRC dataset is depicted in Fig. 10. Specifically, Fig. 10(a) illustrates the load-balancing performance of the compared methods under different reliability settings. As the server reliability increases, MEAN prioritizes storing new files over adding replicas of stored files. Consequently, the score of MEAN experiences a gradual and slight  increase. However, when server reliability is set at 1, MEAM reverts back to Scenario One's algorithm where chunks are evenly and randomly distributed across all edge servers. This results in uniform traffic distribution among different servers and promotes better load-balancing performance. In unreliable settings, HotDedup fails to achieve optimal load balancing due to the fact that only a small number of servers handle users' requests while the majority are idle. Therefore, the measurement results will exhibit a relatively large score. Fig. 10(b) presents the load-balancing performance under varying storage capacities. PopF and PopF_3R methods exhibit superior load-balancing performance due to their even distribution of stored files across different servers. In contrast, MEAN's score is slightly higher as it tends to store popular files on more reliable servers while less popular files are placed on lower reliability ones. Nonetheless, MEAN's score remains relatively stable and reasonable due to its ability to distribute retrieval traffic based on chunk replicas of stored files.

4) Performance Under Different Workloads:
We further evaluate the impact of different workloads on file retrieval performance by setting different arrival rates λ. A large arrival rate can effectively reflect a large number of retrieval requests during peak hours. We increase the arrival rate λ from 60 to 210 and then measure the average retrieval delay and throughput of different methods for retrieving 100 files. Table III illustrates the average retrieval delay under different arrival rates, and the corresponding throughput is presented in Fig. 11(a). With the increase in arrival rates, the retrieval delay of all methods exhibits an upward trend. However, the number of requests that edge storage can serve varies due to different methods of file allocation. As a benchmark, the Cloud_only Fig. 11. Average retrieval throughput and CDF of retrieval delays. method forces all retrieval requests to be served by the cloud server, leading to severe congestion on the backbone network. Therefore, it experiences a more rapid increase than others. Besides, since server failures are not considered, HotDedup can only slightly reduce the latency to a limited extent. When the arrival rate is set to 210, the average retrieval delay reaches 9.351 seconds, followed by HotDedup at 7.622 seconds. This is almost 30 times higher than MEAN. The high volume of requests competing for the limited bandwidth of the backbone network also resulted in the inefficiency of average throughput, both of which are lower than 50 Mbps. In contrast, the other three methods can improve file availability because of their fault-tolerant mechanisms. They experience only a slight increase in retrieval delays, while MEAN can further achieve a delay reduction of more than 40%, compared to PopF and PopF_3R. Fig. 11(b) exhibits the CDF of retrieval delays in one round of experiments, where the arrival rate is fixed as 120. MEAN completes up to 87.5% of requests within 0.5 seconds, compared to only around 65% for the PopF and PopF_3R methods, and less than 20% for the Cloud_only method. In addition, Cloud_only and HotDedup both suffer from long-tail distributions. Their maximum delay is up to 3.5 seconds, compared to around 2.23 seconds for other methods. 5) Failure of the Management Node: In the above experiments, we assume that the management node is deployed on a highly reliable server (such as the proprietary metadata server provided by the service provider) and disregard any potential impact resulting from its failure. Nevertheless, the management node can be a single point of failure in the practical deployment, as it is responsible for retaining all metadata related to files stored at the edge. If the management node fails, it may cause a complete system failure and all retrieval requests will have to be handled by the cloud.
To mitigate the impact of single-point failure, there are two viable approaches. The first is to deploy multiple management nodes at the edge. We vary the reliability of management nodes from 0.8 to 1.0 and conduct 100 rounds of experiments to evaluate the impact of management node reliability. In each round, we randomly designate some nodes (including the management nodes and the storage nodes) as failures based on their reliabilities and generate 1,000 file requests according to the file popularity to simulate data retrievals. The results are presented in Fig. 12(a), where MEAN-1 M, MEAN-2 M, and MEAN-3 M correspond to the MEAN method with 1, 2, and 3 management nodes, respectively. The storage capacity of the edge cluster is set at 20% of the dataset size while the reliability of each storage server is set at 0.8.
When there is only one management node (MEAN-1 M), variations in reliability can have a significant impact on the average file hit ratio. A decrease in reliability from 1.0 to 0.8 can lead to a reduction of over 10% of the average hit ratio. Deploying more management nodes can significantly enhance the system's robustness and effectively mitigate single-point failure. When the reliability of management nodes is 0.8, deploying two management nodes can achieve a 10% increase in the average hit ratio compared to using a single management node. This indicates that increasing the number of management nodes is an effective strategy for managing the failure of management nodes. However, adopting more management nodes may yield diminishing marginal returns. When the reliability of management nodes exceeds 0.85, there is a negligible difference in the average hit ratio between utilizing two or three management nodes.
The second approach involves utilizing the distributed hash table (DHT) or its variants [49] to store metadata in a decentralized manner. In such a deployment, the metadata of edge files is distributed among edge storage servers based on the hash values of files, following a consistent hashing algorithm. The file can only be retrieved from the edge if both the server retaining its metadata and the servers storing its referenced chunks are available. The experimental results are depicted in Fig. 12(b), where MEAN-DHT employs a DHT-based approach, while the other three comparisons utilize a management node with varying degrees of reliability (e.g., MEAN-0.9 indicates a management node reliability of 0.9). The effectiveness of the DHT-based approach is highly sensitive to the reliability of storage nodes, with an average hit ratio increase of over 40% as the reliability of storage nodes increases from 0.5 to 0.9. When the reliability of storage nodes is 0.9, the DHT-based approach can yield a slightly higher hit ratio compared to employing a management node with the same reliability. However, MEAN consistently achieves the highest hit ratio when the management node is reliable. The average hit ratio can reach up to 93.44% when storage nodes have a reliability of 0.9. This suggests that implementing a highly reliable management node is an effective strategy for enhancing the performance of MEAN.

VI. CONCLUSION
In this paper, we present MEAN, a deduplication-enabled storage system using unreliable resources at the network edge. MEAN improves the file hit ratio by jointly considering space efficiency and file reliability. Thus, it can effectively reduce file retrieval delays under unreliable environments and alleviate the congestion of the backbone network. We provide efficient heuristics based on similarity-aware hierarchical clustering and elaborate our MEAN strategy with three different reliability scenarios. The comprehensive experimental results based on the prototype and the real-world dataset demonstrate the superiority of MEAN in the file hit ratio, average retrieval throughput, and average retrieval delay. Pin Lv (Member, IEEE) received the BS degree in software engineering from Northeastern University, Shenyang, in 2006, and the PhD degree in computer science and technology from the National University of Defense Technology, Changsha, in 2012. He is currently with the School of Computer Electronics and Information, Guangxi University, Nanning, and also with Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning. His research interests include wireless networks, mobile computing, Internet of Things, etc. He is a member of CCF and ACM.
Bowen Sun received the double bachelor's degree in petroleum engineering and computer science from the China University of Petroleum, Beijing, in 2020. He is currently working towards the MS degree with the College of Computer, NUDT, Changsha. His main research interests include network measurement and in-network computing.