A Review of Population-Based Metaheuristics for Large-Scale Black-Box Global Optimization—Part I

Scalability of optimization algorithms is a major challenge in coping with the ever-growing size of optimization problems in a wide range of application areas from high-dimensional machine learning to complex large-scale engineering problems. The field of large-scale global optimization is concerned with improving the scalability of global optimization algorithms, particularly, population-based metaheuristics. Such metaheuristics have been successfully applied to continuous, discrete, or combinatorial problems ranging from several thousand dimensions to billions of decision variables. In this two-part survey, we review recent studies in the field of large-scale black-box global optimization to help researchers and practitioners gain a bird’s-eye view of the field, learn about its major trends, and the state-of-the-art algorithms. Part I of the series covers two major algorithmic approaches to large-scale global optimization: 1) problem decomposition and 2) memetic algorithms. Part II of the series covers a range of other algorithmic approaches to large-scale global optimization, describes a wide range of problem areas, and finally, touches upon the pitfalls and challenges of current research and identifies several potential areas for future research.


I. INTRODUCTION
T HE CURSE of dimensionality is "a malediction that has plagued the scientists from the earliest days" [1] and taming it has been at the heart of many research efforts in computational sciences ranging from computational linear algebra [2] and machine learning [3] to numerical optimization [4].The prime motive in all these areas is to devise new ways of building scalable computational systems capable of "doing more with less." In the context of numerical optimization, the curse of dimensionality is caused by the exponential growth in the size of the search space with respect to an increase in the number of input variables.In recent years, this has been loosely referred to as "large-scale optimization" or "large-scale global optimization."The term global is to emphasize the role of heuristics and metaheuristics, especially in the context of continuous optimization.It should be noted that the notion of "large-scale" changes over time and varies from problem to problem.In this article, a problem is considered large scale if it causes scalability issues on the state-of-the-art algorithms.
More specifically, a single objective optimization problem can be defined as follows (assuming minimization): min f (x), x = (x 1 , . . ., x n ) ∈ A (1) s.t.: g(x) ≤ 0 ( 2 ) where A is the domain of the function f and x is an ndimensional vector in A, and g(x) = (g 1 (x), . . ., g p (x)) and h(x) = (h 1 (x), . . ., h q (x)) are vectors of inequality and equality constraints, respectively.Large-scale optimization is concerned with the scalability of optimization algorithms as n grows in size and its effect on the number of constraints and their dimensionality.Rapid technological advancements cause the emergence of ever-growing optimization problems in various areas.For example, in construction engineering, we are entering the so-called "era of the megatall buildings" with the construction of the first kilometer-tall building already underway [5].This has resulted in large-scale optimization problems [6] in construction engineering.Similarly, the data explosion phenomenon has caused the emergence of large-scale optimization problems at the heart of many data analytics and learning problems [3], [7].Advances in machine learning and the This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/rise of deep artificial neural networks have also resulted in optimization problems with over a billion variables [8], [9].These optimization problems not only grow in size but do so in an exponential manner, i.e., the number of decision variables they entail also grows exponentially [10].This rapid growth has stimulated scientific research in various areas to build scalable optimization algorithms.Fig. 1(a) clearly shows the rapid growth in the number of scientific articles published on largescale optimization in the last decade.Fig. 1(b) also shows the contribution of different subject areas to the topic.The dominance of computer science and mathematics is an indication of the importance of the algorithmic aspects of designing efficient optimization methods.
Population-based metaheuristics have also gained popularity in recent years for solving large-scale global optimization problems [11].Despite the common criticism of being computationally expensive, the ubiquity of parallel computing has rendered the issue of population size and generational cost of secondary nature in light of their unique capacity in dealing with multimodal landscapes, deceptive functions, and their general search capability.It has recently been shown that evolutionary algorithms, a type of population-based metaheuristics, can rival classic optimization methods that have dominated the field of deep learning [12].Evolutionary algorithms have also shown great capacity in solving problems of millions or even billions of variables where their classic counterparts proved inefficient [13]- [17].
In dealing with very complex problems, we are deemed to resort to two main strategies: 1) to find an exact solution to a simplified model of a given problem and 2) to find an approximate solution to a complex but more accurate model of a given problem.Therefore, using a less competent optimization (or search) algorithm either demands over simplification of a given problem, or results in a low-quality solution to a more accurate model of the problem.In the context of large-scale global optimization, developing better search algorithms has two implications: 1) we can start to tackle even larger (more complex) problems and 2) we can find better solutions to the problem of the same size.In recent years, a wide range of methods has been considered for large-scale optimization.These often fall within one of the following themes.

6) Parallelization.
In this two-part survey, we give a comprehensive review of large-scale global optimization looking into each of the above themes, summarizing the main research findings, and discussing their advantages and disadvantages.The scope and the breadth of topics we cover in this series make it distinct from other review works in the field [11], [18].Areas, such as large-scale constrained optimization and large-scale multiobjective optimization, have become active in the last two years, which are almost absent from other reviews.They also categorize the large-scale algorithms into decomposition and nondecomposition methods.Based on the main approaches stated above, in this article, we give a more nuanced taxonomy of approaches to large-scale global optimization.
Another unique feature of this article is that it looks at the above themes from a variable interaction perspective.We believe that the efficiency of algorithms is largely dependent on their effectiveness in exploiting problem structure.Variable interaction is an important attribute of optimization problems with a direct effect on a problem's degree of nonlinearity, its overall structure, and degree of modularity.In classic genetic algorithms (GAs) research, for example, tight linkage is central to their scalability [19].The design of an inversion operator is also motivated by the importance of tight linkage to minimize the disturbance of interacting genes by the crossover operator.In memetic algorithms, variable interaction affects the efficiency of dimensionwise local search [20], [21].In cooperative coevolution (CC) and other divide-and-conquer methods, variable interaction governs the utility of a decomposition [22].In estimation of distribution algorithms (EDAs) and other related algorithms, their rotational invariance is determined by how well the interaction information are captured within their covariance matrix [23].These are just a few examples to emphasize the extensive impact of variable interaction.Part II of this survey also features a section on pitfalls and challenges of large-scale optimization, open research questions, and other special topics, such as large-scale multiobjective optimization and large-scale dynamic optimization.
Outline: This survey series comes in two parts and a supplementary document accompanying part I.The two parts jointly cover the six approaches listed above as well as five major problem areas, including overlapping functions, imbalanced contribution, multiobjective optimization, constraint handling, and benchmarks and applications.Fig. 2 depicts a high-level structure of the main topics covered across both parts.Part I covers problem decomposition (Section II) and hybridization, and memetic algorithms and local search (Section III), which are two of the most widely researched approaches in the field.Part II covers the remaining algorithmic approaches, several major problem areas, and a section on the pitfalls and challenges of large-scale global optimization, and some suggestions for future research.The background material is covered in the supplementary document accompanying part I.

II. LINKAGE LEARNING AND EXPLOITING PROBLEM STRUCTURE
The guiding principle of the algorithms presented in this section is to exploit the hidden or clouded structure of a given problem.Exploiting problem structure is common in many branches of optimization [24]- [27].Gray-box optimization is a relatively new concept referring to the study of incorporating problem structure into the optimization process [28] (see Fig. 3).This notion of finding and exploring the "hidden order" [29] is an old one in evolutionary computation and has been studied extensively in the context of linkage learning [30], [31].Goldberg et al. [32] showed that linkage plays a crucial role in the performance of GAs.Even separable problems can be exponentially difficult for simple GAs if the variable dependence information is unknown [32], [33].It has also been shown that GAs with no linkage learning mechanism requires an exponentially growing population size in order to locate the global optimum [34].In the continuous domain, the rotation of the fitness landscape, which has the effect of introducing linkage between decision variables, changes the time complexity of GAs from O(n log n) to O(n n ) [35].It is, therefore, clear that in high-dimensional spaces, for an algorithm to be computationally plausible, incorporation of structural information is paramount.
In gray-box optimization, it is assumed that the problem structure is known a priori.This is particularly the case for a wide range of discrete and combinatorial problems.For example, it is not realistic to assume that nothing is known about a traveling salesman problem (TSP), therefore, treating it as strictly black-box [36], [37].The same goes for other popular discrete problems such as gene sequencing [38], or pseudoboolean problems such as MAX-kSAT or CNF-SAT [39].Constraints of a problem can also reveal some information about its structure.For example, the variable reduction strategy [40], [41] allows the explicit use of constraint functions to infer some relationships among the variables and formulate some variables as functions of a set of core variables.In many problems, however, particularly, in the continuous domain, the structure may not be evident.Therefore, to exploit the problem structure, it first needs to be discovered.Many algorithms have been proposed for discovering problem structure in the form of capturing its variable interaction topology.The merits of these algorithms are not limited to converting black-box problems into gray boxes.There are studies showing that even when the problem formulation is fully known (white box) and the dimensionality is not necessarily high, variable interaction analysis methods can help reveal nontrivial information about the problem [42].
There are also several different forms in which the structural information can be used to enhance the optimization performance.Some methods such as CC [43] decompose the problem into a number of explicit lower dimensional subproblems, therefore, requiring a variable interaction analysis method to form a plausible problem decomposition.Some other methods, such as EDAs [44], [45] and Bayesian optimization algorithms (BOAs) [46], do not decompose the problem per se.They instead implicitly capture and make use of the interaction information in a probabilistic model of the problem they construct during the optimization process.In the rest of this section, we review such methods in the context of large-scale optimization.Fig. 4 gives an overview of implicit and explicit methods of exploiting problem structure.The section that follows covers the implicit methods and Section II-B covers the explicit methods.

A. Implicit Methods
Implicit methods exploit problem structure by building and evolving a model of the problem, which can then be used to bias the search toward promising regions of the search space.These methods differ in their choice of model representation and the information from which the model is created.
Interaction Adaptation: These methods are extensions of simple GAs with added mechanisms to promote tight linkage in problem representation.The simplest of such mechanisms is the inversion operator [47], which acts on a string of variables and changes their order to increase the likelihood of placing the interacting variables next or close to one another.Although the reordering of variables (genes) has shown to improve the performance of GAs [48], they are extremely slow in generating tight linkages [49].This clearly limits their applicability to large-scale optimization.
Other methods such as the messy GA (mGA) [32], fast mGA (fmGA) [50], gene expression mGA (gemGA) [51], linkage learning GA (LLGA) [52], and linkage evolving genetic operator (LEGO) [53] use more sophisticated representations than a simple bit string to encode linkage information into the representation and employ special operators to change Fig. 4. Exploiting problem structure can be done implicitly (Section II-A) by adapting the problem representation or probabilistic modeling, or explicitly (Section II-B) by means of variable interaction analysis (Section II-B1) and problem decomposition (Section II-B2).
the representation over time to promote tightly linked representations.
Probabilistic Models: These methods use a probability distribution to represent the objective function or its various characteristics.EDAs [44], [54] compact GA (cGA) [55], BOAs [46], [56], and Bayesian optimization [57] are examples of such methods.The high-level working principle of these methods is to generate one or more sample solutions from a probability distribution.The generated solution(s) are then used to update the parameters of the probabilistic model in an iterative process.Many different versions of such probabilistic optimization algorithms exist in the literature, which differ in the choice of their probabilistic models and how they are updated.Below is a short summary of three major types.
1) Building a Model of Variable Interactions: EDAs [44], [54], and BOAs [46] are two important such algorithms that capture variable interactions in their probabilistic models.BOAs are different from Bayesian optimization, which builds a model of the objective function and is discussed in the next bullet point.EDAs with a multivariate normal distribution can represent the interaction by adapting the covariance matrix of the Gaussian distribution.Different variants of EDAs have been used for large-scale global optimization, which will be covered later in this section.BOA uses Bayesian networks to represent interactions.
To do so, the algorithm needs to learn the structure of the Bayesian network and its parameters (conditional probabilities).This method is very flexible and efficient in representing and solving complex interaction patterns, such as overlapping [58] or hierarchical [58] problems.However, the model selection process that involves learning the structure of the Bayesian network is an NP-hard problem, which makes BOA a poor choice for large-scale global optimization.2) Building a Model of the Objective Function: In Bayesian optimization [57], not to be confused with BOA, the unknown objective function is treated as a random function modeled by placing a prior distribution over it.Whenever the actual objective function is evaluated, its input/output pair is used as new evidence to update the prior distribution to form the posterior distribution of the objective function.Finally, an acquisition function is constructed from the posterior distribution, which is used to determine the location of the next query point.Scalability of Bayesian optimization is the subject of several recent studies [59]- [67]; however, a detailed review of such techniques is out of the scope of this article.For a review of Bayesian optimization, the readers are referred to this article by [57].

3) Building a Model of the Population Movement:
The covariance matrix adaptation evolution strategy (CMA-ES) [68] is a popular model building algorithm, which uses a multivariate Gaussian distribution to model the so-called successful steps taken by the algorithm during the course of optimization.In differential evolution (DE), its contour matching property [69], which is similar to the notion of modeling population movement, the step sizes, and their orientations are adapted to the landscape of the objective function.Details of advances in the DE literature on large-scale optimization is given in part II of the survey.Estimation of Distribution Algorithms: A major problem of EDAs is their scalability issue in solving large-scale problems.Three major reasons contribute to this scalability issue: 1) accurate estimation of the distribution of high-quality solutions requires a relatively large sample size, which grows exponentially with the dimensionality of the problem [70]; 2) a small sample size will result in poor estimation of the eigenvalues of the covariance matrix [71]; and 3) the cost of sampling from a multidimensional Gaussian distribution increases cubically with problem size [72].
The strategies to alleviate the scalability issue of EDAs can be categorized into two major types: 1) space partitioning and dimensionality reduction methods where the aim is to control the complexity of the covariance matrix and 2) the use of heavy-tail distributions to improve exploration and diversity.
Space Partitioning and Dimensionality Reduction: EDAs with univariate models [73], [74] treat an n-dimensional problem as n 1-dimensional problems and as such are the simplest and have the most efficient sampling.However, several theoretical [75], [76] and empirical [77] studies have shown that univariate EDAs are inadequate in solving nonseparable problems.A full multivariate model on the other hand can be computationally expensive in high-dimensional spaces.
Dong et al. [78] proposed an EDA with model complexity control, which applies a threshold to the global Pearson correlation matrix to find weakly correlated dimensions and models them with univariate distributions and partitions the remaining strongly correlated variables into a set of lower dimensional spaces each of which is modeled using a multivariate distribution.To alleviate the deficiencies of Pearson correlation with non-Gaussian samples, Xu et al. [79] proposed to use mutual information instead to detect variable dependence.A major downside of using mutual information, however, is its high computational cost [22].Space partitioning by means of random grouping and CC has been used in several other studies [80]- [83] to manage the complexity of EDAs.
Dimensionality reduction techniques and space projection are other alternatives to control the model complexity of multivariate EDAs.Kabán et al. [84] borrowed the idea of random projection to lower dimensions from the theory of random matrices to tackle the scalability issue of EDAs [84], [85].They proposed an algorithm that randomly projects the original large-scale problem into an ensemble of lower dimensional problems.The random matrix theory suggests that with an appropriate choice of target dimensions, it is possible to preserve important features of the original space (such as Euclidean distances or dot products) in the reduced space within a reasonable tolerance [86].It has also been shown that the distribution of the sample points becomes more Gaussian in the reduced space [87].These features makes it possible to capture the variable correlation of the high-dimensional space using a lower dimensional subspaces; therefore, making the parameter estimation of EDAs more viable using less computational resources.This eliminates the need for over simplification of the model in EDAs as is the case in univariate EDAs.It is clear that random projection to a lower dimensional space is not unique, and sample points can be projected down into any subspace of the original space.For this reason, [84] uses an ensemble of projected points and estimates the covariance of the sample points in each lower dimensional subspace.Finally, the ensemble of projections is used to construct a new population (sample) in the original space.It has been shown that a proper combination of the ensemble of projected points results in a natural smoothing effect that ensures the exploration capability of the algorithm.In a similar vein, [88] used PCA to transform the multivariate Gaussian model of EDA into its principal lower dimensional latent subspace.
Heavy-Tail Distributions: Sampling based on heavy-tail distributions has been used in many population-based algorithms [89]- [91] with the aim of improving exploration and population diversity.A range of such distributions, including Lévy, Cauchy, and t-distributions, has been used to enhance the performance of EDAs [73], [92], [93].Among these, the Cauchy distribution has been used more widely with EDAs.Although the literature is clear on the efficacy of Cauchy sampling on low-dimensional problem [94], [95], there is controversy in its utility on high-dimensional problems [96].Hansen et al. [97] reported that Cauchy's long jumps are almost invariably ineffective, while other studies found it beneficial [73], [74].Sanyang et al. [92] compared the performance of univariate and multivariate Cauchy distributions with the Gaussian distribution within a random projection framework on a range of large-scale problems with up to 1000 dimensions.They reported that although a multivariate Cauchy performs better than a univariate Cauchy, they both perform worse than a multivariate Gaussian distribution on large-scale problems.In another subsequent study, they provided a theoretical explanation for the poor performance of Cauchy distribution on large-scale problems, which was shown to be consistent with empirical results [96].The study showed that unlike Gaussian norms, Cauchy norms lack a good concentration property causing a disproportionate number of very large steps, which results in an inefficient search strategy as the dimensions increase.
Scalability of CMA-ES: CMA-ES [98] is a popular optimization algorithm, which approximates the contours of a given objective function by means of a Gaussian distribution.CMA-ES exhibits several invariance properties including rotation invariance, which is central for efficient optimization of nonseparable functions.This is achieved through iterative updating of a covariance matrix used to control its underlying sampling Gaussian distribution.CMA-ES has two major limitations in dealing with high-dimensional problems [99]: 1) high number of strategy parameters, which is the degrees of freedom in the covariance matrix and scales quadratically with the dimension.In other words, the space complexity of CMA-ES is O(n 2 ) and 2) high computational complexity that comes from the operations needed to adapt the covariance matrix, i.e., sampling from a multivariate normal distribution, and updating of the covariance matrix and its factorization and eigen-decomposition.This results in a cubic complexity in the number of dimensions O(n 3 ).
There have been several attempts to improve the scalability of CMA-ES by reducing its time and space complexities.Igel et al. [100] replaced the eigen-decomposition update rule of CMA-ES with the Choleskey decomposition resulting in reducing its time complexity to O(n 2 ).Poland and Zell [101] proposed the adaptation of the most significant mutation vector in the main vector adaptation evolution strategy (MVA-ES), which reduces the time complexity of the algorithm to linear.Knight and Lunacek [102] restrained the optimization of an n-dimensional problem to its m main components.This algorithm, L-CMA-ES, reduces the time complexity to O(m 2 n) and the space complexity to O(mn).For the case of m = 1, the algorithm reduces to MVA-ES, and for m = n, it reduces to CMA-ES.Sun et al. [103] proposed another linear time evolution strategy, R1-NES, which uses a low-rank approximation of the covariance matrix by only considering its predominant eigen-direction.Ros and Hansen [99] proposed sep-CMA-ES, in which only the diagonal elements of the covariance matrix are adopted.This reduces the time complexity of the algorithm to O(n).LM-CMA [104], [105] is another limited memory implementation of CMA-ES inspired by the classic L-BFGS [106].LM-CMA combines the idea of a lowrank approximation of the covariance matrix with Choleskey decomposition to reduce the time and space complexity of the algorithm to O(mn).Inspired by LM-CMA, Li et al. [107] proposed the fast CMA-ES for large-scale optimization, which maintains a number of evolution paths and uses two to generate new solutions.
Beyer and Sendhoff [108] proposed matrix adaptation ES to avoid the construction of the covariance matrix by replacing it with an overall transformation matrix involving only matrix-matrix and matrix-vectors operations.This has reduced the time complexity of the algorithm to (n 3 / log(n)).Loshchilov et al. [109] further improved the time and space complexity to (n 2 ) by replacing the multiplicative updates with additive ones.
Li and Zhang [110] proposed to reduce the time and space complexity of CMA-ES by restricting the covariance matrix to a specific simple form.More specifically, they suggested the following form: C = αS + L, where C is the covariance matrix, S is a sparse matrix, L is a symmetric low-rank matrix, and α > 0. When S = I, and L is a rank-one matrix, the algorithm becomes a rank-one ES (R1-ES).By controlling the rank of L, the model can be generalized to rank-m ES (Rm-ES).He et al. [111] proposed to completely replace the Gaussian sampling by a Gaussian mixture model, which exhibits richer variable interaction modeling capability.

B. Explicit Methods
Explicit methods capture problem structure information into explicit forms such as variable interaction matrices or trees and use them to either decompose the problem into a set of lower dimensional subproblems [43], or design special variation operators, such as crossover, that respect the problem structure [112], [113].Unlike implicit methods, explicit methods require extra objective function evaluations to find the variable interaction structure.
One popular explicit method, which has gained popularity in large-scale global optimization, is the CC [43].The CC framework requires the problem to be decomposed into a set of lower dimensional subproblems each of which is optimized separately.The CC framework maintains a separate population for each subproblem (a.k.a component), which are "coevolved" in a round-robin fashion.Since the candidate solutions to each component do not form a complete solution, representative solutions of other components are required to form a complete solution for evaluation.These representative solutions form a complete solution known as the context vector [114], which is used to evaluate all partial solutions.The context vector is updated iteratively and acts as the context in which the cooperation occurs.
The first design choice in using CC is problem decomposition.It is clear that this can be performed in many different ways.The first CC algorithm [43], cooperative coevolution genetic algorithm (CCGA), decomposes an n-dimensional problem into n 1-dimensional problems, where n is the problem dimension.CCGA was used to solve problems with up to 30 dimensions.Liu et al. [115] made the first attempt to solve large-scale optimization problems using a CC framework.They used fast evolutionary programming as the component optimizer in a CC framework with the decomposition strategy of CCGA to solve problems with up to 1000 dimensions [115].
Van den Bergh and Engelbrecht [114] suggested that full decomposition strategy of CCGA runs the risk of introducing pseudominima, i.e., "minima created as a side effect of the partitioning of the search space."This is consistent with the observation that CCGA does not perform well on problems with interacting decision variables, such as Griewank and Rosenbrock test functions [43].To alleviate this problem, van den Bergh and Engelbrecht [114] proposed the use of particle swarm optimization (PSO) with a k s-dimensional decomposition instead of the extreme n 1-dimensional decomposition used by CCGA.Another decomposition strategy, divide-inhalf, was proposed by [116] where the problem is divided into two equally sized components, which are optimized iteratively using DE [117].
It is clear that the algorithms discussed so far are oblivious to variable interaction and may place interacting variables in different components.This has a detrimental effect on the optimization performance and makes the algorithm sensitive to the updating policy of the context vector.A function can be decomposed in many different ways without any knowledge of the underlying variable interaction structure.It is, therefore, important to form the components such that the interaction between them is kept to a minimum.In the next section, we review several algorithms that attempt to deal with the variable interaction problem.
1) Dealing With Variable Interaction: This section is devoted to the review of various techniques to address variable interaction.Most of the methods presented here are concerned with an accurate identification of variable interactions, which are subsequently used to decompose a problem into its constituent parts.The decomposition of a problem often has two components: 1) a mechanism to identify interactions between the decision variables (covered in this section) and 2) a mechanism to form a set of subproblems based on the interaction information (Section II-B2).It is often the case that the interaction detection mechanism necessitates a certain way of forming the groups.For example, some methods provide the interaction information for every pair of variables in the form of an interaction matrix.This matrix contains sufficient information about the number of components and their sizes to warrant an automatic decomposition.Some other methods rely on various heuristics that increase the likelihood of placing interacting variables close to one another.Such methods do not suggest an obvious decomposition of the problem, therefore requiring extra information such as the number of components and their sizes to be supplied by the user.
In this article, we identified seven interaction detection principles and three decomposition mechanisms.Fig. 5 shows this classification and the interplay between them.Table S-I in the supplementary material contains a list of specific algorithms belonging to each class and its corresponding decomposition strategy.In what follows, each detection principle and decomposition mechanism is studied in further details.
a) Random grouping: Random grouping [118] is the most basic way of dealing with variable interaction in CC.The rationale behind it is to randomly arrange the decision variables after each coevolutionary cycle to increase the probability of placing interacting variables into the same component.Despite being superior to an arbitrary static decomposition [119], random grouping has two major drawbacks.First, the user has to decide about the number and the size of each component.Second, the probability of simultaneously placing several interacting variables in one components approaches zero when there are many such variables.More specifically, given N cycles, the probability of assigning v interacting variables into one of the m available components for at least k cycles is [120] Equation (4) shows that the probability of placing v variables in one component for at least k cycles decreases geometrically as v increases.Fig. 6(a) shows the sharp decline in probability correctly grouping interacting variables as the number of such variables increases.The figure also shows that a 200fold increase in the number of random reordering can hardly accommodate two extra variables with the same probability.Fig. 6(b) shows how the probability of correctly grouping v = 3 to v = 6 interacting variables increases with the number of trials; however, this does not significantly increase the likelihood of a correct grouping when the number of variables is high.
b) Delta grouping: An alternative grouping approach called delta grouping [121] was shown to outperform random grouping on most functions from a set of 20 large-scale benchmark problems [122].The rationale behind delta grouping is that the improvement interval of nonseparable variables is relatively smaller than those of separable variables [123] (Fig. 7).Therefore, delta grouping sorts the variables based on the average dimensionwise displacement of the sample points between two consecutive iterations.Once the decision variables are sorted based on the magnitude of their average displacement called "delta" (δ), they are grouped into k components of size s both of which are determined by the user.A major drawback of delta grouping is its low performance on functions with more than one nonseparable component.Ge et al. [124] improved the grouping quality of delta grouping by measuring the success and failure rate of groups and evolving the grouping accordingly.Liu et al. [125] used the idea of delta grouping to solve nonseparable functions.The algorithm that they proposed uses line search along each dimension to estimate the improvement along each dimension.The dimensions with similar improvements are then grouped together to form a subproblem for optimization.Unlike delta grouping, which is used with CC, this method uses local search and a modified DE, which applies the mutation operator to a designated subset of decision variables belonging to the same group.c) Fitness difference minimization: The methods reviewed in this section are based on adaptive rearrangement of decision variables to minimize an error function, which is claimed to minimize the interaction between resultant components [126].The rationale is that for a partially separable function, the difference between the overall objective function and the sum of its nonseparable subfunctions should be zero, i.e., f (x) − m i=1 f i (x i ) ≡ 0, where f i (x i ) is the ith nonseparable subfunction.Motivated by this, Sayed et al. [126] decomposed a problem by finding an arrangement of decision variables that minimizes an approximation to the following least square equation: (5) Due to the black-box assumption, the number of component sizes and their dimensions is unknown.Therefore, Sayed et al. [126] assumed uniform k d-dimensional components and search of a rearrangement of the decision variable that minimizes the following equation: This equation is an approximation to (5) where c 1 and c 2 are solutions whose elements are set to constants c 1 and c 2 , respectively, and fi (x; y) is a parameterized version of f with the variables belonging to the ith component set to x and the rest to y.This variable grouping method, which is called dependency identification (DI), was shown to outperform random grouping on the CEC'2010 large-scale benchmark suite.Aguilar-Justo and Mezura-Montes [127] replaced the random rearrangement with two more systematic strategies to generate variations in the grouping, which showed better performance as compared to the random case.A problem of all decomposition methods based on fitness difference minimization is that they require the user to specify the number or the size of each component.To fix this problem in the context of constrained problems, Aguilar-Justo et al. [128] proposed to evolve the best arrangement of the decision variables as well as the number of components using GA.
Sayed et al. [129] also proposed a variation of (6) where the sum of absolute differences is used instead of sum of squares.Despite DI's improved performance as compared to random grouping, the optimization problem defined in ( 6) is NP-hard due to the large number of possible k d-dimensional decompositions.To alleviate this problem, Sayed et al. [126] randomly rearranged only 10% of the decision variables with greedy search.Dai et al. [130] conducted a study on the effect of this parameter on the performance of DI.The experimental results showed that DI is sensitive to this parameter with a tendency toward a better performance when the rate is larger than 60%.The results also confirmed the superiority of DI over random grouping on a wide range of rates; however, this was not the case when compared to delta grouping.
d) Statistical methods: The methods reviewed in this section rely on statistical features of the evolving population to infer variable interaction.Tiwari et al. [131] proposed the idea of using regression analysis to deal with what they call inseparable function interaction and variable dependence.Inseparable function interaction is identical to what we call variable interaction or epistasis in this article.Variable dependence, however, pertains to the existence of functional dependence between decision variables, i.e., the possibility of expressing one variable as a function of some other set of variables.Although the two notions appear to be linked, the authors did not investigate their connection from a formal mathematical point of view.Following the idea of variable dependence, Tiwari and Roy [132] proposed the GA for variable dependence (GAVD), which uses regression analysis to find a set of dependent and independent decision variables by iteratively setting regression coefficients of individual decision variables to zero and examining the goodness of fit.Once the independent and dependent decision variables are found, GA is used to optimize the independent set and regression analysis is used to estimate proper values for the dependent set.Roy and Tiwari [133] proposed the generalized regression GA (GRGA) to deal with inseparable function interaction.They employed regression analysis on the decision variables in the course of optimization and study the changes in the coefficient of the regression model over time to guide the optimization process.A major drawback of both GAVD and GRGA is poor scalability.
One statistical way of dealing with variable interactions is to calculate the Pearson correlation matrix of the population and use a threshold on the coefficients to form the components.Here, the assumption is that weak correlations indicate weak interactions.These techniques often calculate the correlations based on either the entire population or its top p% samples.Correlation-based adaptive variable partitioning (AVP) [134] is one such method which groups pairs of variables whose correlations are larger than a predefined threshold and optimizes them against the remaining set in a co-evolutionary manner.Singh and Ray [135] proposed an improved version, AVP2, which can form multiple groups of various sizes depending on the underlying variable interaction structure.For each variable i ∈ {1, . . ., n}, AVP2 forms a group with all other decision variables whose correlation coefficient with the ith variable is above a certain threshold.This results in a total of n potentially overlapping groups, which are subsequently merged based on their common variables to form a set of disjoint groups.Rojas and Landa [136] proposed another correlation-based decomposition, 4CDE, which calculates the correlation coefficient of each variable and the objective function and divides the resulting correlation coefficients into equally sized intervals.The variables whose correlation coefficient fall within the same interval form a component, which is subsequently optimized in a CC framework with DE as its component optimizer.This process is repeated and the correlation coefficients are updated using exponential smoothing.Some algorithms use the dimensionwise variance magnitudes of the population to form the groups.In variance priority CC, Wang et al. [137] formed the groups based on the variance magnitude of the current candidate solutions along various dimensions and selected the top k variables having the largest variances to form a component.Liu and Tang [83] proposed a cooperative coevolutionary framework based on CMA-ES, which adaptively selects from a pool of three decomposition strategies at each coevolutionary cycle.The three decomposition methods are: 1) random grouping; 2) min-variance decomposition (MiVD); and 3) max-variance decomposition.The rationale behind using various decompositions is to regulate the exploration/exportation balance of the algorithm.MiVD sorts the variables based on the magnitude of the main diagonal elements of CMA-ES's covariance matrix and divides them into k s-dimensional groups.This strategy minimizes the intragroup variances.Conversely, MaVD maximizes the intragroup variance and forms the groups by taking a variable from every group formed by MiVD which is also used in variance priority CC proposed by [137].To coordinate between various decomposition methods, the algorithm assigns a probability value to each decomposition method at the end of each cycle, which is used to randomly select a decomposition method for the next cycle.
All the statistical methods presented so far cannot detect interacting variables with a reasonable precision and rely on a user to specify the number and/or the size of components.For example, although the use of variance magnitudes has its own merits and can potentially improve the optimization performance, its effectiveness in capturing variable interaction has not been established.The Pearson correlation can only capture linear relationships among the variables, which is also inaccurate for measuring variable interaction.To address these issues, Sun et al. [22] proposed the maximum entropic epistasis (MEE), which uses maximal information coefficient [138] to check for functional relationship between a variable i and the partial derivative of the objective function with respect to another variable j, i.e., (∂f /∂x j ).Since MIC is based on mutual information, it can capture nonlinear relationships.To deal with the black-box nature of the problem, MEE approximates (∂f /∂x j ) using finite differences: (∂f /∂x j ) ≈ ([f (x j + δx j ) − f (x j )]/δx j ).MEE uses this method to check for direct interactions between all pairs of variables by applying a threshold on the MIC values to form a binary variable interaction matrix.Finally, the interaction matrix is treated as the adjacency matrix of an undirected graph and the groups are formed by finding independent components using the breadth-first search algorithm.MEE has shown a very high variable interaction detection accuracy on an array of 24 high-dimensional functions.Two major drawbacks of MEE are its high computational cost caused by pairwise analysis of the decision variables, and its sensitivity to the choice of the threshold value.
e) Metamodeling: The methods discussed in this section infer variable interaction information in the process of building a surrogate or a metamodel for the objective function.Although metamodeling is often used for expensive function optimization, there has been little or no attempts to use it for finding the problem structure.One such algorithm is proposed by [139], which uses the high-dimensional model representation (HDMR) [140] technique to find interacting variables.HDMR has the following general form that can be used to approximate a function where f 0 is the zeroth order term, f i (x i ) is the first-order terms capturing the effect of each variable acting independently, f ij is the second-order term capturing the correlated contribution of x i and x j , and finally, f 1...n is the nth-order term capturing the joint correlation of all decision variables not covered by all other terms.HDMR has a finite number of terms and is exact once all terms are included.A nice property of HDMR is that if the contribution of a pth order terms is negligible, the contribution of all higher order term will be smaller.Mahdavi et al. [139] used a particular type of HDRM called RBF-HDMR [141] to approximate the objective function up to the second-order terms in order to capture variable interaction between pairs of variables.This new technique is capable of finding the number of components and their sizes with good accuracy and has shown good performance on the CEC'2010 large-scale benchmark suite.In another study, [142] used cut-HDRM [143] to find separable and nonseparable variables, which are used in turn to approximate a given high-dimensional function using support vector regression HDMR.
f) Monotonicity detection: Munetomo and Goldberg [144] were the first to propose a variable interaction detection method by checking the violation of monotonicity conditions through systematic perturbation of the objective function.The monotonicity conditions are defined as follows [144]: where s (•) denotes a candidate solution vector perturbed at the index specified in the bracket.It is clear that (8) checks for monotonic increase, and (9) checks for monotonic decrease.Munetomo and Goldberg [144] developed an algorithm based on ( 8) and ( 9) called linkage identification by nonmonotonicity detection (LIMD), which checks for violation of these conditions on a randomly initialized population.Any violation of the above two conditions for variables i and j will declare them as interacting.LIMD was tested on low-dimensional binary problems.
In the context of large-scale optimization, Chen et al. [145] proposed CC with variable interaction learning (CCVIL) in which they used a similar principle as was used in LIMD for variable interaction in large-scale continuous problems.They declared two dimensions i and j interact if the following condition holds: ∃ s, s (i) , s (j) , s (ij) : f (s) > f s (i) ∧ f s (j) < f s (ij) .(10) Similar to the notation used before, s (•) denotes the solution s perturbed at dimensions specified in the parenthesis.Although (10) appears to be different from the monotonicity checks defined by ( 8) and ( 9), the algorithmic implementation of CCVIL ensures that the following condition is also satisfied: f (s) > f (s (j) ).CCVIL uses monotonicity checking within a coevolutionary framework proposed by [146] to find disjoint interaction groups of a given objective function.CCVIL starts by assuming full separability among all decision variables.It then iteratively processes all dimensions and checks their interaction with other dimensions.If an interaction is detected, the respective dimensions merge to form an interaction group.It should be noted that CCVIL uses a generalization of (10) in which multiple dimensions can be perturbed simultaneously.This allows the algorithm to check for interaction between a decision variable and an entire interaction group.CCVIL has shown good performance on the CEC'2010 LSGO benchmark suite [122].
It should be noted that satisfaction of the monotonicity conditions for a given set of sample points does not guarantee separability of the decision variables.Indeed, two variables are separable only if the conditions hold for all possible choices of the four sample points needed in ( 8) and (9).Therefore, repeated application of the monotonicity conditions can increase the likelihood that two variables are actually detected as separable.Based on this idea, Sun et al. [147] proposed statistical variable interdependence learning (SVIL) in which the monotonicity conditions are checked for all pairs of variables over k random sets of points.SVIL treats the proportion of detected interactions between the ith and the jth variables as their interaction probability, which is stored in a matrix D. To decompose a problem, SVIL applies a threshold on D to force it into a binary matrix.Since the grouping accuracy of SVIL is sensitive to the choice of the threshold parameter, the algorithm adapts this parameter in the course of optimization.Next, for each decision variable, a group is formed with all other decision variables, which interact with it.This means that SVIL forms n overlapping groups where n is the dimensionality of the problem.Finally, SVIL is used in a CC framework to optimize each component in a round-robin fashion.Although the empirical results on highdimensional CEC'2005 benchmark suite [148] suggest that SVIL is effective in improving the optimization performance, its variable interaction accuracy has not been studied independently using modular benchmarks suite as the CEC'2010 [122] and CEC'2013 [149] large-scale suites.
A major drawback of SVIL is its high computational cost due to its pairwise interaction detection mechanism between the decision variables, which makes it a quadratic algorithm, i.e., O(n2 ).To alleviate this issue, Ge et al. [150] proposed a generalized version of the monotonicity check in which is simultaneous perturbations of multiple dimensions are allowed to check for interaction between two sets of decision variables B 1 and B 2 .This is equivalent of replacing dimension i with B 1 and dimension j with B 2 in ( 8) and ( 9).This generalization allows us to check the interaction of a single variable with all other variables with a single application of monotonicity conditions, i.e., B 1 = {x i } and B 2 = {1, . . ., n}\{x i }.Based on this generalization, the authors propose a recursive algorithm where the interaction of a single variable (x i ) is checked against a set, which is initialized to all other variables.If x i is found to be separable, the procedure returns; otherwise, the set is recursively divided into smaller sets until its cardinality reaches one.At this stage, all function calls return and merge the interacting variables into a group.This reduces the time complexity of the algorithm down to O(n log n).For separable functions, the decomposition can happen in linear time in the number of dimensions n.
g) Finite differences: The methods discussed in this section use finite differences to detect interacting variables.Although not explicitly defined as such, linkage identification by nonlinearity check (LINC) [151] is a variable interaction learning algorithm based on finite differences to find the linkage sets of binary optimization problems.The same mechanism has also been used to identify interacting groups for real-valued problems (LINC-R) [152].Both LINC and LINC-R were used with multipopulation GAs for solving low-dimensional problems [151], [153].Omidvar et al. [154] proposed differential grouping (DG) by deriving a set of finite difference equations for interaction detection from the definition of partially additive functions (see Definition S-3 in the supplementary material) and applied it to large-scale problems.Omidvar et al. [154] showed LINC-R equations can be derived from the DG theorem.Despite their algebraic equivalence, DG is less susceptible to computational roundoff errors due to its simpler computations.
Theorem 1 [154]: Let f (x) be an additively separable function.∀a, b 1 = b 2 , δ ∈ R,1 δ = 0, variables x p and x q interact if the following condition holds: where refers to the forward difference of f with respect to variable x p with interval δ.Theorem 1 states that two variables x p and x q interact if the result of ( 12) for a given value of x p yields different results for different choices of x q (i.e., b 1 and b 2 ).Theorem 1 is derived by showing that under the assumption of additive separability, the finite difference functions on the two sides of ( 11) give the same results, i.e., separability =⇒ (1) = (2) .The contraposition of this proposition can be used to detect interactions, i.e., (1) = (2) =⇒ nonseparability.Since exact equality cannot be checked on computational systems, the left-hand side of the implication is changed to λ = | (1) − (2) | > .It should be noted that although (1) = (2) implies an interaction, their equality does not necessarily imply separability.Indeed, the theorem is silent for this case.However, by invoking weak syllogism, one can argue that observing (1) = (2) makes separability more plausible.If repeated application of Theorem 1 with different random values results in (1) = (2) , the probability of the two variables being separable increases exponentially [152].However, for most practical applications, a single equality observation is sufficiently accurate for finding separable variables.Nonetheless, failing to detect an interaction is more detrimental to the optimization performance than the conservative case of assuming separable variables to be interacting.
Although Theorem 1 can be used to individual interactions between pairs of variables, the theorem itself does not dictate a particular grouping mechanism.Canonical DG works by choosing a candidate variable x i and scanning all other dimensions to find all interactions with x i .If an interaction is found, the variable is removed from the set being scanned and is grouped with x i to form a component.The process continues until all variables interacting with x i are extracted and added to the component containing x i .This procedure assumes full nonseparability within a component, i.e., all variables interact with all other variables.This approach fails on problems with overlapping components.For example, if the following interaction pattern exists: x i ↔ x j ↔ x k , in the first pass of the algorithm, x j will be removed from the set and the interaction between x j and x k will never be checked.Rosenbrock is such a problem on which DG exhibits a poor accuracy [154].
Another grouping strategy is to form an n × n interaction matrix and form the nonseparable groups (or components) based on analyzing the interaction matrix.Several algorithms are based on this idea and treat the interaction matrix as the adjacency matrix of an undirected graph and use the connected components algorithm to form the groups.Global DG (GDG) [155], graphbased DG (gDG) [156], and DG2 [157] are such methods.These algorithms can potentially give an accurate picture of the problem structure at the expense of having a quadratic time complexity.Among these, DG2 is the most efficient and achieves the theoretical lower bound for the required number of function evaluations to form a complete interaction matrix based on the repeated application of Theorem 1.This lower bound suggests that to improve the efficiency we either need to compromise the accuracy or change the underlying theorem.These two possibilities are addressed next.
Indirect Interactions: Sun et al. [158] proposed the notion of indirect interactions (Definition 1) and used it in an algorithm called extended DG (XDG) to improve the grouping efficiency.
Definition 1: Let f : R n → R be a differentiable function.Decision variables x i and x j conditionally (indirectly) interact if for any candidate solution x * , ([∂ 2 f (x * )]/[∂x i ∂x j ]) = 0, and a set of decision variables {x k 1 , . . ., x k t } ⊂ X exists, such that XDG works by forming n components, one for each variable, containing all the variables it interacts with.This means that a variable may appear in multiple components.To reduce the number of function evaluations, XDG relies on Definition 1 and does not check x i and x j for interaction if both interact with a common variable x k .Finally, XDG merges the groups whose intersection is not null.Although XDG saves some function evaluations due to the indirect interaction assumption, it is not as efficient as DG2 and its time complexity remains O(n 2 ).Fast interdependency identification (FII) [159] is another algorithm that draws on the notation of indirect interaction to improve the grouping efficiency.
Generalized Theorem: Hu et al. [159] were first to use simultaneous perturbations multiple dimensions to check the interaction of any two sets of decision variables using only four function evaluations.This makes it possible to find all separable variables in O(n) by iteratively checking each variables against all other variables.For nonseparable variables, the algorithm starts with a candidate variable and checks it against all other variables.If an interaction is detected, the variables are merged to form a group.Thereafter, the variables formed into the group are perturbed simultaneously to find if any of its members interacts with the next decision variable.Since a group is always checked against a variable, a considerable number of function evaluations can be saved.
Sun et al. [160] formalized the notion of simultaneous perturbations and proposed the following extended version of the DG theorem: Theorem 2 [160]: Let f : R n → R be an objective function and D = {1, . . ., n}; X 1 ⊂ D and X 2 ⊂ D be two mutually exclusive subsets of decision variables: X 1 ∩ X 2 = ∅.If there exist two unit vectors u 1 ∈ U X 1 and u 2 ∈ U X 2 , two real numbers l 1 , l 2 > 0, and a candidate solution x * in the decision space, such that there is at least one interaction between a variable in X 1 and another in X 2 .
Theorem 2 implies that with only four function evaluations, the interaction between arbitrary sets X 1 and X 2 can be established.Sun et al. [160] used Theorem 2 to propose recursive DG (RDG) to form the interaction groups.This reduces the time complexity of interaction detection to O(n log n), which is lower than the theoretical lower bound based on DG2 [157].As stated earlier, this reduction in time complexity comes at the expense of losing on the accuracy of a full interaction matrix.As a result, algorithms, such as RDG, FII, and XDG, cannot detect overlapping components of a function.For instance, these algorithms cannot distinguish the interaction graphs depicted in Fig. 8. Yang et al. [161] further reduced the computational cost of RDG by maintaining and using historical information during the decomposition process to avoid some unnecessary evaluations.Kim and Choi [162] also improved upon RDG by pruning its recursive search three in the divide-and-conquer process.They also used a variable influence metric to presort the decision variables with the aim of increasing the chance of pruning and hence, reducing the total number of objective function evaluations.In another study, Xue et al. [163] reduced the depth of RDG's recursion tree by dividing the variables into three subsets [164] instead of two, thus reducing the number of objective function evaluations.Xue et al. [163] proposed an alternative view of the simultaneous perturbations and augmented it with the topological information of the problem to further reduce the computational cost of problem decomposition.
Interaction Detection Accuracy of Finite Difference Methods: In addition to the challenge of computational efficiency, most finite difference methods presented in this section are sensitive to the threshold parameter ( ) used to distinguish between separable and nonseparable variables.Theoretically, a nonzero λ = | (1) − (2) | signifies an interaction.However, computational roundoff errors can sometimes cause nonzero λ values even for separable variables.Therefore, when observing a nonzero value, it is important to find whether it is caused by a genuine interaction or by computational errors.This clearly affects the choice of , which has been investigated in several studies.gDG [156] normalizes the values which make it less sensitive and uses tighter threshold they call σ to detect interactions.This resulted in 100% accuracy almost all of the CEC'2010 benchmark suite.Cao et al. [165] extended gDG to decompose large-scale multiobjective problems.GDG [155] and EVIID [162] define to be a function of the objective function based on the rationale that the magnitude of the computational errors is a function of the objective function value: = α • min{f (x 1 ), . . ., f (x k )}, where x 1 . . .x k are k random samples of the search space, and α is a small user-defined parameter allegedly less sensitive to computational errors as compared to .Although GDG found the ideal decomposition for 18 out of 20 functions from the CEC'2010 LSGO benchmark suite, its performance deteriorates on the imbalanced functions of the CEC'2013 LSGO benchmark suite [166].In FII [159], two different threshold values are used, one for detecting separable (λ = | (1) − (2) | < 1 ), and another for the nonseparable variables (λ = | (1) − (2) | > 2 ).Despite this suggestion, FII uses 1 = 2 = 10 −2 for the experiments.
Omidvar et al. [157] conducted a systematic error analysis of DG2 to place tight bounds on the roundoff errors.DG2 estimates the greatest lower bound e inf and the least upper bound e sup of the computational errors.For each pair of variables, if the quantity λ < e inf , it is treated as genuine zero and the pair will be declared as separable; otherwise, if λ > e sup , it is treated as a genuine nonzero value and the pair will be declared as having interaction.For λ ∈ [e inf , e sup ], will be set to a weighted average of the two bounds based on the total number of genuine zero and nonzero detections.Unlike the previous finite difference methods, DG2 is parameter free and calculates a different threshold value for each pair of the decision variables.On the CEC'2013 LSGO benchmark suite, DG2 outperformed CCVIL, DG, GDG, and XDG.Chen et al. [167] proposed the global information-based adaptive threshold (GIAT) as an improved method for setting the threshold value based on e inf and e sup .GIAT calculates these two quantities according to DG2, and it then calculates the quantity ζ = ([(λ − e inf )f s (λ − e inf )]/[max{| (1) |, | (2) |}]) for all pairs of variables similar to the way it is done by DG2.Finally, all ζ values are sorted and the two adjacent values with the largest difference are taken as the basis for calculating the threshold.GIAT uses this approach only on partially separable functions and sets the threshold value to the minimum of the two retrieved ζ values.
We close this section by reviewing two recent variants of RDG.RDG2 is the state-of-the-art decomposition algorithm that applies the error analysis mechanism of DG2 on the recursive mechanism of RDG to find the error bounds.This algorithm inherits the accuracy of DG2 and efficiency of RDG and outperforms both methods in grouping accuracy with a time complexity of O(n log n) [168].RDG3 [169], the winner of CEC'2019 Competition on Large-Scale Global Optimization, 2 builds upon RDG2 and includes a mechanism to deal with problems with overlapping components. 3 2) Grouping Principles: This section revisits the explicit decomposition methods reviewed in the previous section and 2 For more information on LSGO competitions see part II of the survey. 3Problems with overlapping components are covered in part II of this survey.
analyzes them based on their grouping mechanism rather than the interaction detection principles (Fig 5).The grouping principles can be classified into three major groups: 1) automatic; 2) semiautomatic; and 3) k s-dimensional.a) Automatic: The groups are either formed automatically in which case the number and the size of components are determined by the algorithm.This is usually done by processing the interaction information identified by the detection mechanisms outlined in Section II-B1.For example, DG2 and GDG use the connected components algorithms on the interaction matrix of the function to form the groups.Other algorithms such as graphDG [156] use other graph partitioning techniques to form the groups.The automatic methods are predominantly based on DG and monotonicity detection, which are among accurate interaction detection mechanisms (see Table S-I in the supplementary document).The variations in the grouping principles of these techniques mainly affect problems with overlapping components.Peng et al. [170] suggested a solution exchange scheme between components based on multimodal optimization to cope with grouping inaccuracies.Ren et al. [171] also proposed the bihierarchical CC, which occasionally merges components to deal with decomposition inaccuracies.Overlapping problems are covered in part II of the survey.
b) Semiautomatic: Semiautomatic methods require the size or the number of components to be specified by the user.Cluster based methods such as the algorithm proposed by [172] use the fuzzy c-mean algorithm to form the groups, which requires the number of components as input.Some studies [173], [174] use spectral clustering with DG to take the degree of interaction into account.In statistical detection methods, such as AVP2 [135] and 4CDE [136], a threshold or a set of intervals should be defined on the correlation coefficients to form the groups.
A special type of semiautomatic grouping is called multilevel [175], [176], in which the user specifies a list of potential component sizes for the algorithm to choose from.The algorithm often uses a probability distribution to choose a component size from the list and uniformly divide the ndimensional problem into smaller components.The algorithm often adapts the parameters of the probability distribution based on the performance of the selected item in the course of optimization.Some multilevel algorithms, however, use deterministic methods to gradually reduce the number of components during optimization [177].c) k s-dimensional components: These algorithms are the least informed and require both the number and the size of each component to form the groups.Detection principles, such as random grouping [118], [120], [178], [179], delta grouping [121], fitness difference partitioning [126], [127], [129], [130], and statistical methods [83], [134], are among such methods.

C. Advantages and Disadvantages of Explicit and Implicit Methods
Explicit and implicit methods each have their own merits.In dealing with partially additively separable functions, explicit methods are generally more efficient in finding the interaction structure using systematic perturbation methods than the implicit methods, relying on statistical information through random sampling.Although dimensionality is a major challenge to both types, the efficiency of implicit methods drops more sharply than explicit methods in higher dimensions.Accurate model estimation in implicit methods requires an exponentially increasing sample size with dimension [70], whereas the state-of-the-art decomposition algorithms require O(n log n) to infer interactions [160].When the interaction structure is more complicated and a "crisp" decomposition is not possible, implicit methods are generally more flexible in representing and exploiting such structures.For instance, when the underlying components of the objective function have shared decision variables (overlap), it is not clear how it should be decomposed into a set of disjoint groups (see Section VI-A).Another advantage of implicit methods is that they build their model during optimization whereas explicit methods work offline and infer interactions prior to optimization.However, as was stated before, this flexibility comes at the cost of significant computational overhead to find a suitable model in the first place.For instance, BOAs use Bayesian networks, which can represent any arbitrary interaction structure; however, discovering the "right" model is an NP-hard problem [46].Another advantage of implicit methods, such as CMA-ES, is their rotational invariance property making them particularly suitable for solving nonseparable problems.Some studies attempted to combine decomposition methods with implicit methods to help EDAs scale better and bring the flexibility and the invariance property of the implicit methods to decompositions [80]- [83] III.HYBRIDIZATION AND MEMETIC ALGORITHMS According to the No Free Lunch theorem [180], no single search algorithm can uniformly outperform all other algorithms on all possible problems.This suggests that there are niche problem areas in which particular algorithms perform better than others.The aim of hybridization is to benefit from unique search capabilities of several algorithms to find high-quality solutions to problems, which are better than the solutions obtained by the individual algorithms in isolation.More generally, the aim of hybridization in evolutionary algorithms is to [181]: 1) improve their performance (such as convergence speed); 2) improve the final solution quality obtained by such algorithms; and 3) incorporate such algorithms as part of a larger system.Some hybridization algorithms are generic in the sense that they can hybridize any number or combination of existing search algorithms.Ensemble strategies are a major allied topic concerned with the study of designing stable optimization algorithms by combining a set of "unstable and diverse" ones [182].Hybrid local search and memetic algorithms play an important role in large-scale global optimization.

A. Hybrid Local Search Algorithms
As distinct from memetic algorithms, these algorithms are solely based on local search with no explicit global search component.They often rely on an initial systematic initialization, such as orthogonal arrays [183], to attain a good coverage of the search space.Multiple trajectory search (MTS) [183] used three different local search methods employed based on the properties of the search space in the vicinity of existing candidate solutions.Before performing an extensive local search, MTS tests all three local search mechanism and picks the best mechanism that performs the best in that neighborhood.MTS has been tested on the CEC'2008 LSGO benchmark functions with up to 1000 dimensions.Gardeux et al. [184] also combined two line search algorithms, i.e., enhanced unidirectional search (EUS) and 3-2-3 line search algorithm, for large-scale global optimization.EUS searches along lines specified by a series of unit vectors not necessarily aligned with the coordinate system.Although variable interaction can have significant effect on the choice of direction vectors, this aspect has not been studied by the authors.

B. Memetic Algorithms
Memetic algorithms [185] represent a special hybridization paradigm in which local search is applied to individuals within an explorative evolutionary framework to mimic the individual learning procedure.This dual-phase mechanism has the potential to balance between exploration and exploitation forces, which are inspired by Darwinian evolution and the effect of individual learning considered in Baldwinian/Lamarckian evolution, respectively.Memetic algorithms have gained popularity in large-scale global optimization, some of which ranked first in IEEE CEC Competition on Large-Scale Global Optimization. 4Memetic algorithms have also been applied to discrete optimization [186], boolean satisfiability problems [187], and a range of application areas, such as largescale hybrid flow shop problems [188], large-scale capacitated arc routing problems [189], [190], and big data optimization problems [191].
Major design considerations in memetic algorithms are [192]: 1) frequency of applying local search; 2) choice of solutions participating in local search (individual learning); 3) search intensity, i.e., the duration of applying local search; and 4) choice of local search algorithms.In addition to the above, the algorithms using a repertoire of local search procedures need to devise a policy to choose from the available local search operators.In what follows, we review the relevant memetic and other hybrid algorithms used for largescale global optimization in reference to the above design considerations.

1) Local Search Frequency:
The most commonly used way of controlling the frequency of applying local search is to run at regular interval as controlled by a user-defined parameter [193]- [195].An extreme case for this approach is to run the local search procedure at every iteration [183], [196]- [198].In some cases, if the local search process is marked as stagnant, it will not be invoked [196].MTS, which uses multiple local search operators, runs all its operators at the beginning of every cycle and picks the best performing operator for the 4 For more information on competition results, refer to part II of the survey.rest of that cycle.In some algorithms, instead of resorting to running the local search at fixed regular intervals, the local search operators are applied such that the ratio between the local and global search is fixed [199]- [201].In some algorithms such as MA-SW-Chains [199], [200], search frequency and intensity (covered later in this section) are linked through a fixed local/global search ratio.In other words, once the global search ratio is fixed, specifying search frequency or intensity determines the other.Some other approaches decide the invocation of local search probabilistically using a prespecified distribution whose parameters are set and/or adapted based on its success/failure rate [202], [203].
2) Choice of Solutions: The choice of solutions to participate in the local search process can be random, performance based, complete (i.e., all solutions participate) [198], or any combination of the three.For example, the algorithm proposed by [203] chooses five random solutions, the best solution, and the top 25%.The performance-based methods either apply local search to the best solution [194]- [197], [201], [202], top n solutions [183], top p% solutions [193], or those improved more than a predefined threshold in previous iterations [199], [200].Some methods also emphasize the solutions not selected before [199], [200].Zhao et al. [203] proposed to use the niching algorithm Clearing proposed by [204] to choose the solutions to undergo local search.They start from several solutions that progressively reduce to one, with the aim of controlling the exploration/exploitation balance.
3) Search Intensity: Search intensity is defined as the duration in which the local search is active.The simplest way of specifying the search intensity is a fixed-budget policy, i.e., to run the local search operator for a fixed and predetermined number of function evaluations [193], [196], [201], [202].MTS [183] uses a greedy approach and runs the local search algorithm until no improvement is observed.Other algorithms using MTS-based local optimizers also use a similar strategy [194], [195].As was discussed previously in the context of search frequency, some algorithms link search frequency and intensity by forcing a fixed local/global search ratio [199], [200].A more sophisticated way of specifying search intensity is to do so adaptively during the course of optimization.Liu and Li [197] determined the search intensity based on the success/failure rate of the local search procedure.Bolufé-Röhler et al. [198] simply doubled the search intensity every time the local search procedure is invoked.Zhao et al. [203] ran the global and local search procedures once at the beginning of each cycle, and the search intensity for each case is specified proportional to their success/failure rates.
4) Local Search Procedure: A wide range of global and local search algorithms has been hybridized for solving largescale global optimization algorithms.Table I summarizes various hybridizations proposed in the literature.In the table, "G" denotes a global search mechanism, "L" denotes a local search procedure, the combination of which indicate a memetic algorithm, whereas "H" denotes a generic hybridization of a set of local or global search procedures.The table shows that DE [117] is the most widely used global search mechanism followed by PSO [220].A wide range of local search operators is also used that can be categorized as random search, line search, and coordinate descent.Algorithms from all three categories are commonly used in various memetic or hybrid frameworks; however, the coordinate decent procedure of MTS [183], known as MTS-LS1, and the Solis and Wets' [221] random search algorithms are the most popular operators.
In most cases, a single local search procedure is used within a global exploratory process.However, in some cases, more than two operators, local or global, are used simultaneously, the application of which needs to be coordinated by the hybrid framework.For instance, an algorithm proposed by [201] probabilistically selects between random walk and Nelder-Mead [222] based on an exponential distribution whose parameter is adapted according to a measure of population diversity.Another algorithm by [191] adaptively selects between Rosenbrock's [205] and Powell's [206] search algorithms by testing their effectiveness using statistical hypothesis testing methods during the course of optimization.MTS coordinates between three local search operators by running them on m solutions at the beginning of each cycle and uses the best performing operator for the rest of the cycle until it becomes stagnant.There are also some algorithms [125], [213], which employ memetic algorithms in a CC framework.These algorithms decompose the problem into separable and nonseparable components and employ different local search operators suitable for the separable and the nonseparable components.Further details of coevolutionary memetic algorithms are given later in this section.
A multiple offspring (MOS) framework [223] is an abstraction layer on top of the reproductive operators of existing evolutionary algorithms, which systematizes the coordination of several search operators.MOS employs a repertoire of evolutionary operators and applies them based on their performance over the course of optimization in order to achieve a higher long-term performance.In the context of large-scale optimization, several evolutionary operators have been hybridized using the MOS framework [208]- [210], which are summarized in Table I.The experimental results on a wide range of benchmark functions with up to 1000 dimensions showed the scalability of MOS framework to highdimensional problems, making it the first-ranked algorithm in CEC'2013 and CEC'2015 competition on large-scale global optimization.The readers are referred to part II of the survey for a discussion on LSGO competitions and more recent results.
5) Parallel Versus Sequential Hybrids: Memetic algorithms and other hybrid methods can be implemented in a parallel paradigm or a sequential one.In the context of large-scale global optimization, parallel hybrids are limited [16], [214], with most algorithms following a sequential paradigm.Wang et al. [216] suggest that parallel hybrids may not be effective in utilizing the benefits of various search algorithms due to disparities in their convergence speed and diversity maintenance.Molina et al. [224] proposed the idea of local search chains in which also emphasizes a sequential paradigm.The aim of these search chains is to perform an intensive local search during the course of optimization.The term chain alludes to the fact that a local search operator can be applied in succession, and each invocation can resume the search process from where it stopped in its previous invocation.Hence, forming a chain of local searches has the capacity to better exploit the properties of the landscape and focus on more promising regions.The idea of local search chains was first used with CMA-ES [68] as the local search operator to form the MA-CMA-Chains algorithm [224].The computational cost of CMA-ES makes it prohibitive for large-scale optimization.Therefore, [199] employed the Solis Wets' [221] algorithm as the local search operator and proposed MA-SW-Chains.This algorithm, ranked first in the CEC'2010 Competition on Large-Scale Optimization, showed better performance relative to other algorithms on the CEC'2010 large-scale benchmark problems [122].Later, a variant of MA-SW-Chains, called MA-SSW-Chains, was developed in which the local search was only applied to a random subset of the decision variables [200].6) Cooperative Coevolution and Memetic Algorithms: Decomposition methods (see Section II-B) and memetic algorithms are the two most widely used approaches to large-scale global optimization with algorithms from both categories ranked first in large-scale global optimization competitions [169], [225], [226].To benefit from the advantages of both approaches, some authors suggest the use of memetic algorithms [125], [126], [191], [207], [213] or other hybrids [218] as component optimizers in a CC framework.The general approach is to decompose the problem into a set of lower dimensional subproblems using the methods described in Section II-B, and optimize each component using a global search algorithm followed by an episode of local search.Cao et al. [207] proposed to use SaNSDE [227] followed by Solis and Wets' [221] on each component and adjust their search intensity/frequency according to their performance.Sun et al. [213] also used SaNSDE as the global search operator followed by dedicated local search procedures for the separable and nonseparable components.Sabar et al. [191] used two local search operators (i.e., Rosenbrock's [205] and Powell's [206]) in conjunction with DE as the global search algorithm.Liu et al. [125] proposed to use coordinate descent and Quasi-Newton local search algorithms on separable and nonseparable components respectively, followed by a round of DE to further improve the population.
The MLSHADE-SPA algorithm [228], runner-up of the IEEE CEC'2018 large-scale competition, takes a different approach to combining memetic and coevolutionary search.The algorithm divides the entire search process into several rounds where each round is comprised of an initial coevolutionary search phased followed by a local search phase.In the coevolutionary phase, the problem is decomposed into three equally sized components each of which is optimized using a different DE-based algorithm.The coevolutionary phase is followed by local search where the best found solution is improved using a modified MTS-LS1 algorithm [183].

IV. CONCLUDING REMARKS
In this part of the series, we covered two major approaches to large-scale global optimization: 1) algorithms, which exploit problem structure in the form of variable interaction and 2) hybrid algorithms, most notably memetic algorithms and local search.
Exploiting problem structure and gray-box optimization has shown to be effective ways of solving large-scale problems (Section II).These structural information can be used in the form of explicit decomposition or implicitly through model building.The challenge of explicit methods is the cost of offline variable interaction learning, which requires objective function evaluations and causes an overhead on the overall optimization cost.Another issue is that a crisp decomposition is sometimes impractical due to various forms of couplings caused by the existence of multiple objectives, overlapping components (shared variables among subfunctions), or coupling through constraints.Implicit methods also suffer from the accuracy of capturing problem structure, especially when the problem size grows in size.Finding more efficient and effective ways of exploiting structural information, such as overlap, can have a significant impact on improving the scalability of optimization algorithms.
Hybrid methods and memetic algorithms, in particular, use the available computational budget in a more economical way and gain competitive advantage by means of extensive local search.It is not clear how these algorithms may perform in finding global optimum under more relaxed budget constraints.Their dimensionwise local search procedures are generally blind to variable interactions making them better suited for separable functions.In memetic frameworks, the design considerations, such as the frequency of local search or the choice of the local search procedure, are ad hoc and require extensive experimentation.Designing generic frameworks capable of finding the optimal search intensity and the choice of local search procedures can significantly improve the performance of these algorithms and also make them readily available to practitioners in other fields.
In the next part of this survey series, we cover several other approaches to large-scale global optimization and also look at several important problem areas, such as multiobjective optimization and constraint handling.The next part also touches upon two major issues pertaining to the future of the field: 1) pitfalls and challenges that hinder the progress of the field and 2) the pressing open questions and potential areas of future research.

Fig. 1 .
Fig. 1.Publication trend in large-scale global optimization from 1971 to 2021.The results show the number of documents containing variations of the phrases "large-scale optimization" or "large-scale global optimization" in their title, abstract, or keywords.The results also include the hyphenated version and also apply to multi-and many-objective algorithms.(a) Documents per year.(b) Subject contributions.Source: Scopus.

Fig. 2 .Fig. 3 .
Fig. 2. Outline of the topics covered in the two parts of this survey series on large-scale global optimization.

Fig. 5 .
Fig. 5. Common variable interaction detection principles (olive) and variable grouping strategies (dark teal) used by explicit decomposition algorithms.The links indicate which grouping principles are common among which detection principles.The light teal color indicates variation on the grouping strategies.

Fig. 6 .Fig. 7 .
Fig. 6.Random grouping's likelihood of correctly grouping interacting variables (m = 10).(a) Probability of detecting v variables given N cycles.(b) Probability of detecting variables as a function of number cycles.

TABLE I SUMMARY
OF COMMON ALGORITHMS USED IN HYBRIDIZATION FRAMEWORKS.THE ALGORITHMS ARE CLASSIFIED INTO: GLOBAL SEARCH, RANDOM SEARCH, LINE SEARCH, COORDINATE DESCENT, AND OTHER DERIVATIVE-FREE METHODS, SUCH AS NELDER-MEAD, CMA-ES, OR TABU SEARCH.THE CHARACTER G DENOTES THAT THE ALGORITHM IS USED AS A GLOBAL SEARCH OPERATOR, L DENOTES A LOCAL SEARCH OPERATOR, AND H DENOTES A GENERIC HYBRIDIZATION OF A SET OF LOCAL OR GLOBAL SEARCH OPERATORS