Greedy Iterative and Meta-Heuristic Clustering With Coded Caching and Slepian-Wolf Compression for Correlated Content

Content caching has emerged as an effective approach to combat the increasing strains on our current network infrastructure. This method is further improved when combining caching with source coding. However, additional complexity is incurred by creating this hybrid method, as the source coding component comes with associated feasibility constraints and decoding costs. This paper presents an approach to balance this complexity with the coding gains by selecting the best-performing subset of files to compress, while the others are left uncoded. This problem is shown to be NP-hard in general and difficult to solve in an iteration-free manner. To this end, two novel approaches are outlined: an iterative-based solution, which uses the features of the entropy function to derive the most suitable files to compress jointly, and a meta-heuristic version, which is based on the Genetic Algorithm. When compared to an exhaustive search, the proposed solutions are found to be sub-optimal but falling above the 90th percentile of all possible solutions on average. Significantly, the iterative method produces results within one percentile of the meta-heuristic approach yet it finds a solution 2.31 times faster. The iterative approach has an additional benefit, in that it is able to predict the relative gains when adding more files to a compression group. It is thus able to terminate prematurely if the estimated gains are less than a chosen threshold.


I. INTRODUCTION
Recently, there has been a relentless increase in the amount of data traffic as the number of Internet users and Internetconnected devices grows.This, together with ever-improving Internet speeds and availability, has put immense strain on the current network infrastructure.Thus, current research is focused on intelligent methods to organise and deliver content without relying on ad hoc network usage.One promising method is content caching [1], where relevant information is downloaded to user devices during off-peak hours based on expected demand.The server is then able to provide the content to consumers at a reduced rate when the network is constrained.
The associate editor coordinating the review of this manuscript and approving it for publication was Huaqing Li .
In practical applications, the files being requested are often correlated to one another, since many types of content, such as current news and popular videos, have a high degree of similarity [2].Consequently, new techniques have been proposed to improve the efficacy of content caching by taking the correlation between files into consideration [3], [4], [5].
One approach to exploiting the correlation is to use source coding methods to compress the information before caching.Slepian-Wolf (SW) coding [6] leverages high degrees of correlation between information sources to compress their data in a distributed manner, without the need for collaborative communication.Although generally studied in the Wireless Sensor Network (WSN) setting, this technique has proven to be a promising solution to reduce the transmission rate in a caching scenario [7], [8].
A major limitation of this type of SW coding is that the decoding of the information is performed jointly, which becomes computationally expensive when including multiple files.Thus, it is desirable to reduce the complexity of the coding without greatly sacrificing the compression gains.In general, this is referred to as the clustered SW problem [9].One way to achieve this is to limit the number of files to compress, but selecting the most optimal subset of files in terms of compression gains becomes an NP-hard problem which is not addressed by Literature to date.As a result, this paper develops two well-performing but sub-optimal methods that find the best performing subset of files to compress.

A. BACKGROUND AND MOTIVATIONS
Maddah-Ali and Niesen [1] were the first to optimise caching in terms of the global caching gain, which is the total memory available at the user end.They introduced a novel coding technique, called Coded Caching (CC), that intelligently stores parts of all the files on the main server with the end users during the placement phase.When the users' requests are revealed in the delivery phase, the server is able to compress all the requests into a single multicast file based on the knowledge of the files stored previously.On receiving the multicast message, each end user reconstructs their requested file by XOR'ing the contents of their local cache.Much work has been done in this field, such as considering the fundamental coding limits in a case where the caches are shared between the users [10] and the optimal placement of files based on popularity [7], [8].It has also found to be useful in other current research topics, like Information-Centric Networking (ICN), which replaces the traditional server/user model with information being stored in the network itself [11].
In the specific case where files are correlated with one another, Hassanzadeh et al. describe how the caching bounds could be improved [2].This is achieved by dividing the files into different subsets.The subsets can be used to compress one another, since information will be repeated owing to the correlation.The compressed files are then stored at the caches and can be recovered after the users reveal their file requests.The compression has the effect that more of the information can be stored at the cache, minimising the size of the multicast message sent by the server during the delivery phase [3].By simplifying the model, a more optimal placement scheme was designed in [4].This was made more systematic by incorporating Gray-Wyner coding into the compression design [5].However, as the authors note, the constraints on the bounds required to achieve optimal compression grow exponentially with the increase in the number of files.As such, they only present the cases where two files are transmitted to k users and three files are transmitted to two users.
Gray-Wyner coding is considered part of the field of Distributed Source Coding (DSC), where multiple independent pieces of information can be compressed at once, and the goal is to reduce bandwidth usage by limiting communication between users.SW coding falls under DSC and is similar to Gray-Wyner coding, with changes in the general structure [12].In fact, Merikhi and Soleymani use the idea of decoding using side information (a feature of SW coding) in their CC implementation, although it is only used to compress two information sources [7], [8].SW coding, too, suffers from exponential increases in coding constraints [13].Furthermore, DSC schemes require jointly decoding the compressed information received, leading to increased complexity for the number of files included in the scheme.Wang et al. first proposed a method of dealing with this increase in complexity.They clustered the files into groups, such that the overall compression was maximised while the computational complexity was bounded [9].They extended their work to increase the security [14] and overall compression [15] of their system.Shu, one of the authors of the above papers, used the same approach and added robustness by electing backup nodes [16] and energy efficiency by correlating the chance of becoming a cluster head to the distance from the next hop [17].In a similar vein, Yang et al. [13], [18] provide a solution based on Lagrangian multiplier optimisations to organise the sources into a simpler structure, in an effort to reduce the SW decoding complexity.To this end, they simplify the correlation structure by only considering sources that are within a fixed radius of one another.This same simplification is used by Yuen et al. in [19], albeit with a different method to find the optimal results.More recently, Amutha et al. present this problem and use a sailfish meta-heuristic algorithm as a potential solution [20].
As a result, by adapting the clustering solution for SW coding to the CC domain, it is possible to simplify the system model in [5] and achieve the coding bounds for a greater number of files and users without increasing the complexity significantly.However, there are limitations to the current research around clustered SW as well.Firstly, in all the papers cited above, the solution involves calculating the performance of every combination of sources not yet selected.With the increase in the number of sources, this approach becomes computationally infeasible.Yang et al. reduce this complexity somewhat by disregarding correlations below a certain threshold.They also simplify the entropy calculation by modelling the correlation as a Gaussian distribution.Nevertheless, there are two issues with this approach.Firstly it is possible that, even with this simplification, there will still be many sources in the sensing radius if the source distribution is dense.Thus the original problem of complexity will return, since they do not fundamentally change the method for searching for the most optimal grouping.Secondly, using a fixed sensing radius and correlation model might grossly oversimplify the problem, since it is based on a simple spatial distance metric.This does not take into account correlation-specific metrics and will thus not be helpful in the CC domain, where the information sources are files to be compressed as opposed to the WSN consideration, where the information sources are nodes in a network.

B. CONTRIBUTIONS
Motivated by the gaps and shortcomings identified above, the main contributions of this paper are to: 1) Adapt the current work regarding SW coding in a WSN environment to the CC with correlated sources scenario.This involves using the optimisations of clustering for SW to simplify the system model given in CC for a Gray-Wyner network and thus reduce the complexity of the coding and decoding.2) Incorporate a different model of the correlation between information sources into evaluating the entropy performance of the system.We choose to use the summation of Mutual Information Areas (MIAs) instead of Gaussian random variables, as these areas are independent and are able to be tailored to a variety of cases, including files in a CC setting.3) Create two novel solutions to the clustered SW problem with CC considerations.This is done without relying on the simplifications provided in Literature to date.The comparison of our work to other Literature across the different fields is presented in Table 1.

C. LAYOUT
The rest of this paper is structured as follows: Section II describes the system model and optimisation problems, while Sections III and IV outline the two approaches.Section V analyses the complexity and optimality of the various solutions and Section VI presents and compares the simulation results.Section VII presents a brief discussion and outlines future work directions.Finally, Section VIII concludes this paper.

II. SYSTEM MODEL AND PROBLEM FORMULATION
The system consists of two primary components: CC and SW coding.The former ensures that the bandwidth of the server is minimised during peak hours while the latter seeks to reduce the total amount of information needed to be sent by the server to the users by compressing the files beforehand.A general outline of the system model is shown in Fig. 1.The following two subsections provide more detailed modelling for each subsystem.

A. CODED CACHING MODEL
In [7], [8], Merikhi and Soleymani present a CC system model in which users can receive information from the server or from shared remote caches.In contrast, this paper focuses on the single server case with local caches, where Z users connect to a single base station over an error-free broadcast link.The base station contains a library of files represented by the set N. Each element of N is modelled as an information source X i , i ∈ {1, 2, . . ., |N|}.Without loss of generality, each file X i produces F binary symbols which are i.i.d and ergodic.Accordingly, each file has an entropy H (X i ) = F bits, ∀X i ∈ N.However, it is assumed that the files are correlated to one another according to the distribution p(x 1 , x 2 , . . ., x |N| ).In addition, each user has at its disposal a local cache of size M files, or MF bits.We denote Z i as the contents of user i's cache.
The CC system operates in two distinct phases.In the placement phase, the base station intelligently fills the users' caches with files from its library during off-peak hours.Thus, the transmission rate is not constrained in this phase, only the size of memory available at the user end.
In the delivery phase, the users reveal their demands, modelled here as a vector d := {d 1 , d 2 , . . ., d Z }, where each d i is an index corresponding to a request from user i for file X d i .The base station attempts to fulfil the file requests of the users by broadcasting a compressed version of the files, based on the placement in the caches in the previous phase.
As in [5], the objective of the caching scheme is evaluated according to the minimum multicast rate necessary to fulfil the worst demand: where ℓ(•) is the length of the broadcast codeword Y for demand d, and D is the set of all possible demand vectors.Another evaluation metric is the average multicast rate, defined as: It is based on the average broadcast codeword length over all demands.

B. SW CODING MODEL
The correlation between the sources is modelled as the MIAs for each unique subset of N. Thus, there are a total of 2 |N| − 1 areas.Naturally, the entropy of all the files H (N) ≤ |N|F bits.Nevertheless, the correlation values are merely statistical and do not necessarily describe the actual correlation between the contents of the files.
Since the files are correlated, the base station is able to use SW coding to compress the files that are stored in the caches, although the exact contents of the files are unknown.This has the effect of storing more content from the files in the users' caches, meaning that ℓ(Y d ) is reduced.However, there are two primary restrictions on this method.
Firstly, in terms of the compression itself, there are bounds given by Cover [21] for |N| sources.In total, there are 2 |N| −1 bounds corresponding to each combination of sources.For  example, for 3 sources {X 1 , X 2 , X 3 }, the 7 coding bounds are: Significantly, the bound on the total coding rate, and therefore the maximum compression of the system as a whole, is given as the joint entropy, which in general can be expanded to: Equation ( 4) is obtained by repeatedly using the chain rule for entropy.
Secondly, this bound is only achievable if all sources are decoded jointly (although the coding is disjoint).This joint decode is computationally expensive and is governed by the number of sources involved in the coding scheme.
Thus, the general optimisation problem is to decrease the complexity by removing sources from the scheme while minimising the impact on the achievable compression for the other files.As a result, this paper considers a system shown in Fig. 1, which is adapted from [5].In that paper, all the files are compressed before they are transmitted to the caches.The goal of our system is to partition the library into two groups, one of which will be compressed according to a DSC method such as SW with Matrix Partitioning [22], denoted as N c .The other group will remain uncoded and is represented by N u .The sets are disjoint, meaning that N c ∪ N u = N and N c ∩ N u = ∅.This new hybrid library can then be distributed amongst the Z users using a cache encoder optimised for files of unequal lengths (such as [23]).Unlike [4], which considers that files are divided into a finite number of blocks, this paper allows for compressed files of any size.
We now turn to more formally defining the objectives of the system.

C. OBJECTIVES
The main goal of the hybrid system is to reduce the complexity of the encoding and decoding of the compression scheme without sacrificing too much of the compression rate for the system as a whole.However, there are two approaches to achieve this.In the first, the complexity is fixed by setting the maximum number of nodes to compress.Hence, let γ = |N u | be the minimum number of nodes that should not be compressed, at which the number of compressed nodes |N c | = |N| − γ achieves a reasonable decoding complexity.Then, the objective is to choose a subset N u from N that maximises the reduction in information for the coded sources N \ N u = N c .This is formulated as follows: Another approach is to bound the entropy of the compressed group of sources, effectively setting the compression performance in this group.Then, the objective is to find the maximum number of sources to put in N c without exceeding the entropy bound.As a result, let ζ be an entropy value in the range 0 < ζ < H (N). The goal is to minimise the number of sources in N u while keeping the entropy of the compressed group to H (N c ) ≤ ζ .In this instance, the objective function is defined as: These optimisation problems are similar to the Minimum Weight Set Covering (MWSC) problem, where each subset 12326 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
has a weight attached to it and the goal is to choose the fewest subsets that covers all members of the set while minimising the total weight.This problem is known to be NP-hard [24].
Another similar optimisation problem is the 0-1 Knapsack (0-1K) problem, in which a set of members each contain a weight and value.The objective is to choose the best performing subset of members that maximises the total value of the members while not exceeding a certain total weight.This problem too is NP-hard [25].
The optimisation problems in this paper are similar to the MWSC and 0-1K problems, but in those problems the number of combinations in the search space for a set of size |N| is 2 |N| , where in our scenario it is |N| γ .Nevertheless, in our case, the weights (entropies) of each subset are not known beforehand and must be calculated, unlike the MWSC and 0-1K problems.Furthermore, in our scenario, the weight calculation must be summed over 2 γ − 1 areas.As a result, when γ is in the region |N| /2, or if γ is large, our optimisation problems become NP-hard.
Accordingly, it is difficult to find an optimal solution in polynomial time.In the next sections, two sub-optimal approaches are discussed.

III. GREEDY ITERATIVE SELECTION PROCEDURE
The basic approach to solving the optimisation problem in ( 5) is to iteratively select the most suitable sources until a subset of size γ is reached.Further examination of the expansion of the total entropy in (4) reveals that the entropy of the set can be expressed in individual terms, where each term refers to only one source conditioned on other sources.This means that choosing the source X i for each term, such that H (X i |X i−1 , X i−2 , . . ., X 1 ) ≥ H (X j |X i−1 , X i−2 , . . ., X 1 ), ∀j ∈ {i + 1, . . ., |N|} should guarantee a maximisation of the entropy expression at that point.The only exception is the first term, since, as mentioned in Section II-A, the entropies of all sources are set to be the same.In this case, it is necessary to choose the source based on a different criterion.Notice that the final term in (4) is Thus, choosing the source that maximises the entropy term is equivalent to selecting the node that is least correlated with the other sources.
These observations imply that, by continually selecting the largest term from the available set of files to add to N u , the entropy of the set N c = N \ N u is minimised.As a result, the following selection procedure is proposed: Lemma 1 (source selection procedure).Let the sources in N be drawn according to the following conventions: Choose It is possible for there to be multiple options for X k , in which case X k should be chosen arbitrarily.
Lemma 2. If the sources are organised as outlined in Lemma 1, then it is guaranteed that Proof: From (9) it is known that Since conditioning reduces entropy.□ Using this selection procedure thus ensures that the terms in the expansion are in descending order.Theorem 1.Let the sources be indexed according to Lemma 1. If, at any point in the selection process, H ≥ H (X k+2 |X k+1 , X k , . . ., X 1 ) (16) . . .
Together with this, we have Finally, □ Theorem 1 can thus be used as a stopping condition to satisfy (7).If, while choosing the sources, the entropy of the selected source is less than ζ , then the rest of the sources' entropy contribution can be bounded and the selection process will terminate with k sources.
The above derivations are combined to produce the Greedy Iterative Single Group Entropy Minimisation (GISGEM) selection procedure, outlined in Algorithm 1.It is able to generate a potential solution to the optimisation problems in ( 5) and ( 7) by maximising the entropy of the source removed at each step.

IV. META-HEURISTIC APPROACH
The Genetic Algorithm (GA) is a meta-heuristic algorithm whose efficacy in combinatorial problems is well known [26].Every valid combination is represented by a genome sequence, which sets the variables in the optimisation problem.For the single group entropy minimisation problem, the genome structure is updated as follows: the position in the genome represents the source in the set N and can take on the Boolean values 0 and 1 depending on whether the source is in the set N u or not.This means that the genomes are of length |N| with the constraint that the Hamming weight must be equal to γ .
A population of random genomes is created, with rules defined for populating the next generation based on genome crossover and mutation.The crossover function selects features from both parents, based on a random crossover point.In our approach, the crossover functions need to produce child genomes that conform to the weight constraint mentioned above.As a result, the crossover function is changed to the following: given two genome sets G 1 and G 2 (defined as the sources in N u shown by genome sequences g 1 and g 2 ), the child genome set G 3 is constructed by randomly choosing γ sources from G 1 ∪ G 2 .
The mutation function randomly flips a bit in the genome, ensuring that the genome pool does not become too small.Here, the mutation is updated to swap a random number of value pairs in the genome, since simply flipping a single bit could violate the Hamming weight constraint.The rest of the algorithm, such as parent selection, does not require any changes.A summary of this approach can be found in Algorithm 2.

V. ANALYSIS
The methods outlined above are now analysed in terms of the algorithm complexity and whether they produce objectively optimal solutions as compared to a Brute Force (BF) search.The BF approach to solving the entropy minimisation optimisation requires that every combination of N u is determined and evaluated and the grouping that provides the least impact to the compression gains is chosen.This means that the complexity is in the order of O |N| γ .In contrast to this, the GISGEM selection procedure has a much lower complexity.In the ith step, there are |N|−i − 1 terms that are calculated.Thus, the total number of calculations required for γ steps is There is the additional complexity of ranking the terms in each step, however this complexity is negligible when compared to calculating the entropy.As a result, the complexity of this selection procedure is O(γ |N|).The complexity of the GA varies, and is dependent on the size of the population, as well as the number of generations required until the algorithm converges.
In Literature, the authors in [9] propose a greedy algorithm that requires ranking the power set of N.However, their algorithm finds the minimum entropy disjoint grouping for all sources.Nevertheless, when determining the best grouping of size γ , this method will have a complexity in the same order as the BF method.Thus, the GISGEM selection procedure dramatically reduces the complexity as compared to the brute-force and Literature methods when γ is in the region of |N| /2.

B. OPTIMALITY
Although less complex than the BF approach, the following Lemmas show that using GISGEM does not provide an 12328 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
optimal result with regards to the optimisations in ( 5) and ( 7) respectively.Lemma 3. The GISGEM selection procedure is not optimal in terms of the optimisation in (5).
Proof: Since the GISGEM selection procedure is iterative, it begins with N u = ∅ and adds a single source at a time.Without any stopping conditions, this will result in N u = N.However, every possible order of choosing sources to put into N u must follow these same start and end states.Thus, it is impossible that a single selection order will produce an optimal result for any arbitrary 0 ≤ γ ≤ |N|, since it cannot guarantee to be optimal at every point.□ Lemma 4. GISGEM is not optimal in terms of the optimisation in (7).
Proof: Theorem 1 provides a stopping condition to the GISGEM algorithm.Nevertheless, as shown in Lemma 3, this result is not necessarily optimal.Thus, it is possible that another selection of sources has a more optimal N u and, even after removing one or more of the worst-performing sources, would still perform better than GISGEM.□ Thus, theoretically, the GISGEM algorithm should produce sub-optimal results but with less time complexity than the BF and GA methods.The following section details the numerical results obtained for the different approaches.

VI. RESULTS
In the following subsection, an illustrative example is presented, comparing the performance of the different coding schemes, namely the Coded Caching (CC) [1] and CC with SW (SW/CC) [5] approaches from Literature, and the hybrid method with Reduced Complexity (SW/CC-RC) proposed in this paper (depicted in Fig. 1).The methods used to reduce the complexity in the SW/CC-RC system are compared in the next subsection.

A. NUMERICAL EXAMPLE
Consider three files with a correlation model shown in Fig. 2, with each file having an entropy of 60 bits.Furthermore, let the number of users Z = 2, each with cache size MF = 90 bits.For the sake of clarity, we denote X i:j A as representing a sub-file of X A , consisting of bits i to j inclusive, where i < j and files begin with bit 1.As a result, the length of this file ℓ(X i:j A ) = j−i + 1 bits.Furthermore, let {X i:j A ; X k:l B } (with a semi-colon) be the concatenation of sub-files from X A and X B .
In the classic CC method, where correlation is ignored, the ideal placement of files is to divide each file in half and place each half at a different user.When the users' demands are revealed, the server XOR's the files not currently stored by that user together and transmits the joint codeword.For instance, let the demand d = {A, B}.Since user Z 1 already stored X1:30 A in its cache Z 1 and user Z 2 has X 31:60 ).Under these conditions, it is readily verifiable that the peak demand rate R m = Rm = 30 bits.
The second approach, SW/CC, uses SW coding before doing the CC.To begin, it is necessary to calculate the bounds on the coding rate when compressing the files.In this example, there are 7 conditions to be met: The most restrictive of these is (28).Thus, to satisfy all conditions, it is sufficient to set all coding rates to 102/3 = 34 bits.Let the files compressed using these compression rates be XA , XB and XC (this can be achieved using the Matrix Partitioning method for SW coding [22]).Using the same cache size as above, we are now able to store 30 bits from each of these files, corresponding to 88% of the file as opposed to 50% in the CC method.In this case, both caches receive X 1:30 A , X 1:30 B , X 1:30 C .The remaining 12 bits (4 from each file) are transmitted regardless of the demand vector. 1  On the receiver's end, the new information is used to perform a joint decode to losslessly obtain X A , X B and X C .As a result, the performance is R m = Rm = 12 bits.However, this comes at an increased complexity cost, owing to the number of constraints that need to be met, as well as the joint decode complexity.
Finally, the SW/CC-RC method is demonstrated.Before performing the SW coding, the GISGEM algorithm is applied to the files to determine which ones to remove.Table 2 lists all the iterations used to index the files.In the first step, the excess entropies for each file (i.e. the entropy of each file conditioned on all other files) is calculated and the maximum is chosen to be X 1 .The next iterations exclude all previously chosen sources and the conditional entropies are calculated, with the maximum chosen in each successive stage.In this example, the selection procedure is always optimal, since the first and last steps are guaranteed to be optimal and there are only 3 iterations.If we set γ = 1 for example, the algorithm will suggest that X B is removed.Thus, the library is partitioned into two subsets, N c = {X A , X C } and N u = {X B }. Consequently, the SW coding method from the previous SW/CC example will only be applied to the two files in N c .These files have 3 conditions for lossless SW coding: Notice that the bound (31) is different to (26), since X B is treated as independent of X A and X C in this case.Setting R A = R C = 41 bits satisfies all requirements and the files are compressed accordingly.Thus, the library now consists of { XA , X B , XC }.In the next phase, the cache encoder stores 28 bits from the two compressed files and 34 bits from the uncompressed file.The caches are filled this way to balance out the peak multicast rate.The results of this phase are shown in Fig. 3, with Z 1 = { X 1:28 A , X 1:34 B , X 1:28 C } and Z 2 = { X 14:41 A , X 27:60 B , X 14:41 C }. Using this placement, all files can be decoded at the user end regardless of d, as shown in Table 3.In terms of performance, Table 3 also shows that R m = Rm = 26.In summary, Table 4 shows what is stored at each cache for the different approaches.Although the SW-CC method has the lowest average and peak rates, it comes at the cost of higher complexity in designing the coding scheme as well as the decoding algorithm.On the other hand, the SW/CC-RC method sacrifices some of the compression gains to achieve reduced complexity in terms of the number of equations, the decoding algorithm and the number of demand permutations in which SW decoding is needed at each cache (6 out of 12 permutations, compared to 12 for the SW/CC method).Nevertheless, the average rate is still reduced as compared to the regular CC approach.

B. SIMULATION RESULTS
Three methods have been outlined in this paper to determine the sources to remove in the SW/CC-RC approach.These are the BF, GA and GISGEM methods.Literature does not deal directly with the optimisation problem presented in this paper, and the current approaches (when ignoring their simplifications) are equivalent to the BF method.
To perform the comparison, 18 sources were used, where the optimal grouping was found for an increasing number of sources in N u .Fig. 4 shows the entropy obtained for each method when increasing the size of the selected group.Each of the methods are compared to the worst performing configuration (found using the BF method).The results confirm the sub-optimal performance predicted in Lemma 3, as it is found that, in the range of best to worst performing results, the GISGEM's results fall in the 91st percentile on average, with a minimum value falling in the 82nd percentile.The GA's results are in the 92nd percentile on average with the lowest value in the 76th percentile.As compared to the BF method, both the GISGEM and GA are able to find results close to the optimal one.However, since the GA is not constrained to choosing the same grouping as the previous iteration, it is sometimes able to find a better result than GISGEM.Fig. 4 also shows the correctness of Theorem 1, since the distance between the most and least optimal group selections decreases as more sources are selected.This highlights another advantage of GISGEM, as it is able to terminate earlier if it detects that the current number of sources is sufficient to achieve the objective in (5).
The time taken to find the optimal result for each method is given in Fig. 5.The GA is, on average, 4.42 times faster than the BF exhaustive search, while GISGEM is 10.20 times faster and increases linearly with respect to γ .These practical results conform well to the theoretical complexity predictions in Section V-A.
In another simulation, the GISGEM and GA methods were tested, this time in a system with |N| = 22.At this number of nodes, the BF method's complexity becomes prohibitively large.In Fig. 6, the difference between the total entropies of the uncoded groups N u for the GISGEM and GA algorithms are plotted.It shows that, in this run, the GISGEM 12330 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.method outperforms the GA approach for small γ .For larger values of γ the two approaches are closer, with the GA approach sometimes besting the GISGEM one.Nevertheless, as depicted in Fig. 7, the time complexity of GISGEM is consistently lower than the moving average of the GA.

C. GA PARAMETER TUNING
The GA has different parameters that can be tuned to obtain better results.As discussed in [27], these variables are critical to changing the diversification (breadth of search area) and intensification (refining the current results) of the algorithm.They also found that there is not necessarily a single parameter that controls each type of behaviour of the  algorithm.In order to examine the effect of each parameter in our scenario, the GA was run with parameters in the ranges shown in Table 5.To explain the parameters: SG refers to the number of iterations in which the the elite value obtained remains the same, after which the algorithm converges.The Population Size is the number of configurations tested in each generation.The value α e is the number of members of P that survives to the next generation, expressed in terms of the percentage of |P|.The Crossover Percentage is the number of children produced using the gene crossover method.It is expressed as the percentage of |P| − α e .The number of mutation children is not shown here, as it is automatically set as the inverse of α c .The GA was run for |N| = 18 and γ ∈ {7, 8, . . ., 11} for every combination of parameter.Figures 8-10 show the resulting effects, on average, of each parameter on the best result obtained, the time taken to converge as as the number of iterations.These results are also summarised in Table 5.It is clear from the results that the SG has the biggest impact on both quality of results and time complexity.The trade-off between the two is found to be directly proportional.Consequently, an SG of 15 is chosen to slightly favour the quality of results over time complexity.Population size has a medium impact on quality but small impact on time complexity, while the number of iterations is barely affected.This is because increasing the population size increases the number of configurations tested per generation.For these reasons, the maximum population size of |P| = 20 is chosen.The Elite Percentage has a minimal impact on quality of results but a small impact on time complexity.However, the results are found to not be linear, with α e = 20% producing the best time relative to the quality of results.Thus, this value is chosen for the final testing.Finally, the balance between α c and α m corresponds to a small impact in quality but minimal impact on time.It is found that α c = 60% maximises the best results obtained, so this is the value selected.One interesting result is that a change in γ had a small effect on time and minimal effect on the number of iterations taken to converge.This correlates with the head-to-head testing conducted in Section VI-B, where the moving average of the GA did not change much with a change in γ .

VII. DISCUSSION AND FUTURE WORK
Although the GA is suited to combinatorial problems in general, it is possible for a different meta-heuristic algorithm 12332 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
(such as the Particle Swarm Optimisation or Binary Bat Algorithm) to perform better when considering the SW/CC-RC scenario.So too, although the parameters were tuned for this problem, tuning the functions of crossover and mutation might produce better results.Although this is a potential possibility for future work, it is noted that this paper uses the GA mainly as a benchmark with which to evaluate the performance of the GISGEM method and not as a stand-alone solution.
With regards to the GISGEM approach, the current design uses a 'no regret' scheme, where sources selected previously are not able to be removed in a future iteration.However, a more optimal result could be achieved by using a 'look ahead' approach and is a potential avenue for future work.
In addition, further work is necessary to generalise the method to solve the optimal clustering for all sources.In contrast to the single compressed group presented in this paper, the general grouping case allows for many groups of nodes, all of which will be compressed.In the single group scenario, the size of the search space is given by |N| γ .However, for the multiple grouping scenario, an expression that gives the size of the search space needs to be derived.Furthermore, the objective functions need to be changed to reflect the new considerations, where the entire library is able to be grouped, and the sum of the entropies of the groups needs to be minimised.It is also unknown how the approaches proposed in this paper will perform with the new considerations.Finally, it is surmised that the generalised scenario will result in an increase of compression with commensurate increase in complexity.However, it is necessary to compare this trade-off with the current results related to the system design in this paper.

VIII. CONCLUSION
A greedy iterative selection procedure and meta-heuristic approach are proposed as potential solutions to the clustered SW problem in the context of caching correlated information.Files are grouped together, such that the overall decoding complexity is reduced with minimal impact on compression gains.The iterative method is based on the inherent properties of entropy and it is able to find a close-to-optimal result with less time complexity as compared to the BF and metaheuristic approaches.There is the additional benefit that the algorithm can bound the entropy gains at each iteration and can compare that to the relative complexity of the system, allowing it to terminate prematurely if necessary.It is found that these methods are able to successfully reduce the complexity of a SW/CC system while not sacrificing too much of the compression gains.

VOLUME 12, 2024 12325
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Algorithm 2
Genetic Algorithm Approach P ← random population of genomes Calculate and rank the performance of each genome in P while not converged do Choose α c |P| random pairs of parent genomes for crossover for each parent genome pair {g 1 , g 2 } do G child ← γ random sources from G 1 ∪ G 2 ▷ Crossover end for Choose α m |P| random genomes for mutation for each genome g i do g child ← g i swap random number of pairs in g child ▷ Mutation end for P new ← |P|(1 − α c − α m ) best genomes from P P new ∪ crossover children P new ∪ mutation children P ← P new Calculate and rank each genome in P end while A. COMPLEXITY

FIGURE 2 .
FIGURE 2. Example of a correlation diagram for three sources.

B
in Z 2 , the codeword transmitted by the server is Y d = X 31:60 A ⊕ X 1:30 B .At the receiving end, the users XOR Y d with the cached half of the file not requested to decode the rest of its requested file (e.g.Z 1 calculates Y d ⊕ X 1:30 B = X 31:60 A

FIGURE 4 .FIGURE 5 .
FIGURE 4. The performance of each selection method as compared to the least optimal combination for |N| = 18.

FIGURE 6 .
FIGURE 6.The difference between total entropies of N u produced by the GISGEM and GA algorithms for |N| = 22.

FIGURE 7 .
FIGURE 7. The time complexity of the GISGEM and GA selection methods for |N| = 22.

VOLUME 12, 2024 12331
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 5 .
Different parameter settings for GA testing and their impact on the quality of results and complexity.

FIGURE 8 .
FIGURE 8.The effect of different GA configurations on performance.

FIGURE 9 .
FIGURE 9.The effect of different GA configurations on convergence: time.

FIGURE 10 .
FIGURE 10.The effect of different GA configurations on convergence: iterations.

TABLE 1 .
Comparison of current works and our paper.

TABLE 2 .
Indexing of sources using GISGEM.

TABLE 3 .
Encoding and decoding procedure for the SW/CC-RC system.

TABLE 4 .
Cache contents for different coding methods.