A Review of Population-Based Metaheuristics for Large-Scale Black-Box Global Optimization—Part II

This article is the second part of a two-part survey series on large-scale global optimization. The first part covered two major algorithmic approaches to large-scale optimization, namely, decomposition methods and hybridization methods, such as memetic algorithms and local search. In this part, we focus on sampling and variation operators, approximation and surrogate modeling, initialization methods, and parallelization. We also cover a range of problem areas in relation to large-scale global optimization, such as multiobjective optimization, constraint handling, overlapping components, the component imbalance issue and benchmarks, and applications. The article also includes a discussion on pitfalls and challenges of the current research and identifies several potential areas of future research.

Abstract-This article is the second part of a two-part survey series on large-scale global optimization.The first part covered two major algorithmic approaches to large-scale optimization, namely, decomposition methods and hybridization methods, such as memetic algorithms and local search.In this part, we focus on sampling and variation operators, approximation and surrogate modeling, initialization methods, and parallelization.We also cover a range of problem areas in relation to large-scale global optimization, such as multiobjective optimization, constraint handling, overlapping components, the component imbalance issue and benchmarks, and applications.The article also includes a discussion on pitfalls and challenges of the current research and identifies several potential areas of future research.

Index
Terms-Black-box optimization, evolutionary optimization, large-scale global optimization, metaheuristics.

I. INTRODUCTION
T HE FIRST part of this two-part survey series covered decomposition methods and hybrid methods as two most widely investigated approaches in the literature.Fig. 1 depicts a high-level structure of the main topics covered across both parts.In this part, we review more approaches to largescale global optimization and also address several problem areas, including multiobjective optimization and constraint handling.Section II covers the sampling mechanism and variation operators of two well-known algorithms: 1) particle swarm algorithm [1] and 2) differential evolution [2], and how they are modified to solve large-scale problems.Section III covers the algorithms that rely on some form of approximation to cope with the challenges of high dimensionality.Section IV covers population initialization methods and their significance in the large-scale global optimization.Section V addresses the role of parallel algorithms to address the issue of scalability.
In addition to the algorithmic approaches to large-scale optimization, the article also addresses a range of problem areas pertaining to large-scale global optimization.These include: 1) scalability of multiobjective optimization algorithms with respect to their decision space; 2) challenges of constraint handling in the context of large-scale optimization; 3) challenges in dealing with problems with overlapping components and the issue of exploitable structure; 4) resource allocation and the problem of imbalanced contribution; and 5) benchmarking and application areas.The article also features a section on pitfalls and challenges of the field and potential areas of future research.

II. SAMPLING AND VARIATION OPERATORS
In part I of this survey, we have seen that many optimizers, such as estimation of distribution algorithms (EDAs), differential evolution (DE), and particle swarm optimization (PSO), can be used as component optimizers in decomposition-based frameworks (part I Section II-B), and as explorative agents in memetic algorithms (part I Section III).In this section, we focus on algorithm-specific aspects, such as parameter adaptation, modification of variation operators or design of new ones, diversity maintenance mechanisms, etc.In what follows, we cover DE and PSO in more detail and other metaheuristics are covered in Section S-II of the supplementary material.EDAs were covered in Part I Section II-A due to their focus on modeling variable interactions.Fig. 2 shows major aspects of PSO and DE, which have been studied under high-dimensional settings.

A. Differential Evolution
Due to its versatility, ease of implementation, and simplicity, DE [2] has become a widely used optimization algorithm for global optimization [3].Consequently, many variants of DE have been developed for large-scale global optimization [4] from which the most popular ones are briefly reviewed in this section.Most of the DE variants proposed for largescale optimization are centered around maintaining population diversity, which is done by various means, such as parameter adaptation, modification of DE mutation strategy, and diversity maintenance mechanisms.
1) Mutation Strategy: Mutation strategy is DE's central variation operator and has been subject to extensive investigation in [3].Several attempts have been made to improve DE for large-scale optimization by adapting or hybridizing several mutation strategies or by proposing new ones [5].
a) Adaptation of mutation strategy: Different mutation strategies exhibit various degrees of explorative/exploitative power each being suitable for certain problem types [5].In the context of LSGO, several attempts have been made to use several mutation strategies to improve the convergence properties of DE in high-dimensional spaces.These methods are either based on adaptively applying a set of strategies to a single population or using a multipopulation approach where each is evolved using its own mutation strategy.Ali et al. [6] proposed a multipopulation DE, where each subpopulation has its own mutation strategy.Banitalebi et al. [7] proposed a binary DE, which adaptively selects the mutation strategy for generating trial vectors and also adapts the scaling factor and crossover rate using a chaotic process (also see Section II-A2).Kushida et al. [8] proposed a rank-based mechanism for selecting the mutation strategies.Wang et al. [9] proposed to adaptively switch between DE/rand/1 and DE/current_to_best/1 mutation strategies.There are also approaches that switch between DE/rand/1/bin and a newly proposed strategy based on a uniform distribution [10], [11].
b) Vector selection: Canonical DE uses random individuals in the mutation process to generate a scaled difference vector to be applied to a base vector and generate a new solution.The choice of the vectors participating in the mutation procedure plays a crucial role in DE's convergence behavior [5].Ge et al. [12] analyzed different DE strategies and observed that those using the best individual are exploitative while those using random individuals are more explorative.They argue that instead of randomly selecting the participating vectors, it is better to systematically choose a vector close to the best solution to favor exploitation and far from the mutant to favor exploration.Inspired by PSO personal and global-best particles, Wang et al. [13] proposed to generate trial vectors by including the global-best and personal-best individuals in the mutation strategy.The authors claim that this process is akin to neighborhood search and improves convergence.García-Martínez et al. [14] associated four basic roles-1) placing; 2) leading; 3) correcting; and 4) receivingto each vector (solution), and the vector selection for mutation is performed based on these four basic roles.The vector selection strategy proposed by Ali et al. [6] is a function of the rank of a solution in the population, favoring higher quality solutions to participate in the mutation.Zhang and Sanderson [15] proposed a generalization of the classic DE/current-to-best mutation operator, DE/current-to-pbest, which uses the top p% best solutions to balance the greediness level of the algorithm and to maintain better diversity in the population.Some other studies also proposed several mutation strategies in which the solution quality is taken into account in the vector selection process [10], [11].Yang et al. [16] proposed to use multiple such difference vectors that are scaled differently to generate trial vectors.
The choice of the base vector to which the mutation is applied is also important in DE's convergence behavior.Ali et al. [6] proposed a new mutation strategy by selecting the base individual to be a convex combination of randomly chosen individuals from the population.Wang et al. [17] proposed an enhanced opposition-based DE in which the candidate solutions are translated into a so-called opposite space using the definition of opposite numbers [18].Wang et al. [17] argued that by evaluation of the candidate solutions and their translated counterparts in the opposite space, the probability of finding better solutions increases.This hypothesis is backed up by a set of empirical results on a set of 19 high-dimensional benchmark functions [19].Hiba et al. [20] proposed a center-based mutation strategy that uses the center of three randomly chosen solutions as the base vector.
2) Parameter Adaptation: Population size (NP), crossover rate (CR), and the scaling factor (F) are DE's major parameters affecting its convergence properties on different problem types.To eliminate the need for practitioners to set these hardto-tweak parameters, several attempts have been made to adaptively set these parameters in the course of optimization [3].In this section, we review some of these adaptation methods pertaining to large-scale global optimization.
The scaling factor and crossover rate are the two most studied parameters.Most of these attempts use some form of probability distribution from which the parameters are sampled.Brest et al. [21] dynamically changed F and CR using a uniform distribution.Wang et al. [22] used a similar adaptation mechanism except that they restricted the range of CR values.Brest et al. [23] introduced a sign changing mechanism to F in addition to sampling of CR and F values from a uniform distribution.To improve the current best, it uses smaller F values in the second half of the optimization process.Weber et al. [24] used a multipopulation approach each having its own scaling factor, which is regenerated in a probabilistic way.Zamuda et al. [25] proposed to adapt CR and F using a log-normal distribution.Improving upon selfadaptive DE (SaDE) [26], Yang et al. [27] used a Gaussian distribution to generate F and CR for each individual and update the mean of the Gaussian based on the parameter values succeeding in generating surviving offsprings.Zhang and Sanderson [15] proposed JADE, which randomly generates F and CR at every generation using Cauchy and Gaussian distributions, respectively, whose parameters are adapted in the course of optimization.Yang et al. [28] attempted to generalize the attempts by its predecessors, such as JADE, SaDE, and SaNSDE into a unified mechanism.
There are also alternative approaches that are not based on sampling from probability distributions.For example, some studies propose to change the crossover rate and the scaling factor using a chaotic process [7], [9].Takahama and Sakai [29] proposed a DE variant in which the scaling factor is adapted according to the modality feature of the search space.Kushida et al. [8] improved upon the works of Takahama and Sakai [29] by adding a rank-based mechanism for setting the scaling factor and crossover rate, as well as the mutation strategies.
The attempts for adaptation of population size in largescale optimization are ad hoc and limited.Brest et al. [23] proposed to gradually reduce the population size in the course of optimization.Wang et al. [30] adaptively changed the population size by adding or removing solutions based on their performance.Tanabe and Fukunaga [31] linearly decreased the population size.
3) Diversity Maintenance: Loss of population diversity is central to DE's deficiency in high-dimensional spaces.This is typically avoided in lower dimensions by means of increasing the population size [32].However, in high-dimensional spaces, the large population size hinders convergence [33].The adaptation of the mutation scaling factor, hybridizing an array of mutation strategies, or designing new ones are all attempts to improve the population diversity, which were addressed in the previous sections.Other approaches to diversification are multipopulation approaches, either in the form of several islands searching the original search space or by means of partitioning and coevolution, or maintaining an archive of solutions (Fig. 2).
Weber et al. [24] proposed a multipopulation strategy that simultaneously searches different parts of the search space.This algorithm randomly rearranges the individuals across the subpopulations with the aim of maintaining diversity among the solutions.Ge et al. [34] also proposed a multipopulation DE, which maintains diversity through migration of similar or diverse individuals.This mechanism controls the balance between exploration and exploitation.Ge et al. [35] used a multipopulation approach with automatic merge and split operations to improve the population diversity.Ali et al. [6] used a multipopulation DE with each population having its own mutation strategy to maintain population diversity.The information exchange between populations helps with balancing exploration and exploitation.Parsopoulos [36] used cooperative coevolution to partition the search space into smaller regions and optimized them with a micro DE.Micro DE is prone to losing diversity and getting trapped in the local optima; however, CC helps DE to focus the search with its micro population on smaller regions.Ge et al. [12] also used CC with cross-cluster mutation to promote exploration.
Maintaining an archive of solutions in the course of optimization is another means of maintaining diversity.Takahama and Sakai [29] proposed a DE variant with an archive of old solutions to help with diversification in the mating process, especially when the population size is small.Yang et al. [16] also proposed a DE variant, which keeps an archive of failed trial vectors with the hope of preserving the good genetic material.Zhang and Sanderson [15] proposed JADE, which maintains an external archive of inferior solutions to estimate the possible improvement directions.The external archive of JADE proved to be beneficial, especially on relatively high-dimensional problems with up to 100 dimensions.

B. Particle Swarm Optimization
PSO [1], [37] is known to be susceptible to premature convergence, which is magnified on high-dimensional problems [38], [39].Most approaches to handle large-scale optimization problems are centered around increasing diversity to improve exploration.In some cases, however, extreme exploration and exploitation can co-exist [40].The common remedies to PSO's premature convergence in the large-scale global optimization literature are population reinitialization, complementary sampling mechanisms, population size adaptation, space partitioning (by means of cooperative coevolution or otherwise), improving PSO's update rule and particle learning mechanisms, and mechanism to deal with variable interaction and nonseparable problems.
1) PSO Update Rule and Particle Learning: Excessive reliance on the global-best particle can result in premature convergence.Many attempts to avoid premature convergence revolve around reducing the influence of global best.Cheng and Jin [41] proposed a PSO variant, named CSO, which does not use personal or global-best solutions to update the position of the particles.Instead, random pairs are chosen to compete and the winner returns directly to the swarm, while the loser is updated by learning from the winner.CSO maintains a better diversity than PSO and is more explorative, making it better suited for large-scale global optimization.Tian et al. [42] proposed a variant of CSO based on a two-stage update rule for the position of particles and applied it to solving multiobjective problems.Naderi et al. [43] proposed a fuzzy adaptive system to adjust the inertia weight.This eliminates the use of global best in the velocity update rule to avoid premature convergence.Tang et al. [44] used a new update rule to exploit four best positions via Gaussian sampling to reduce the influence of global best and promote exploration.Pluhacek et al. [45] changed the velocity update rule such that with some probability, the velocity is either zero or is updated by taking either a random particle, personal best, or the global best into account.
Controlling information exchange among particles by means of population topologies or multipopulation structures are other ways of enhancing the swarm diversity [46].Fan et al. [47] proposed a PSO variant, which builds a dynamic neighborhood topology for PSO by performing clustering on the population.The neighbors of the particles are chosen from the same cluster.It also chooses a distant neighbor for particles through a random selection.Zhang et al. [48] improved CSO by applying Cauchy and Gaussian updates on the winner particles and used a ring topology to enhance the swarm diversity.Distributed multipopulation schemes [49]- [51] also promote controlled information exchange among particles, which can improve the population diversity.
In addition to the above, several other modifications to PSO's update rule have been suggested in the context of largescale global optimization.Arasomwan and Adewumi [52] found that inertia weight, acceleration coefficients, and random factors were not of significance in velocity update for obtaining global solutions.They proposed to adaptively update particles' velocity based on the Euclidean distance between particles and the global best.It also introduces the chaotic behavior into the particle position update rule.The notion of social learning has been proposed to reduce the adverse effect of isolated asocial learning [53], [54].Yang et al. [55] proposed to group particles into several levels based on their fitness.Two predominant particles from two different higher levels are chosen to guide the learning of particles.This has shown to improve diversity.The convergence speed controller was proposed as an independent operator to respond to premature or slow convergence [56].Cheng et al. [57] proposed a mutation operator based on the alpha-stable distribution to enhance the swarm diversity and avoid premature convergence.Li et al. [58] changed the particles' velocity update rule to decouple exploration and exploitation.Xue et al. [59] used multiple velocity and position update rules that are chosen probabilistically, whose parameters are adapted according to the effectiveness of each strategy.
2) Reinitialization, Sampling, and Population Size Control: Hsieh et al. [60] proposed a PSO variant with dynamic swarm size, which increases or decreases the swarm size based on the status of particles.In general, if the global best of the swarm is not updated for several consecutive iterations, new particles are generated by applying a crossover-like operator on the best solutions that were obtained in the past.Conversely, if the information content of the swarm is rich enough to allow frequent and robust updating of the global best, some of the poor-quality solutions might be removed from the swarm.There are some other mechanisms in place to avoid the growth of the swarm size beyond bounds.De Oca et al. [61] proposed an improved version of incremental particle swarmguided local search [62], which incrementally increases the population size to solve large-scale continuous optimization problems.
Garcí-Nieto and Alba [63] proposed restart PSO with velocity modulation.Velocity modulation is the process by which the particles are guided within a region of interest.Additionally, a restart mechanism is devised to avoid premature convergence of the algorithm.Cheng et al. [64] proposed partial reinitialization to improve exploration.They partition the search space and count the number of particles in each partition and abandon the low-activity areas.It also subdivides and reinitializes particles in higher activity regions.This mechanism has shown to improve exploitation.Zhou et al. [65] introduced opposition-based sampling into CSO.
3) Nonseparability and Coordinate Rotation: The update equations of PSO are dimensionwise, which makes it suitable for separable functions.Hendtlass [66] used dynamic momentum values to enable PSO to better handle nonseparability, making it suitable for functions with interacting dimensions.Korenaga et al. [67] introduced the coordinate rotation into the velocity update rule, which makes it possible to consider the information of other coordinates when calculating the velocity of a component.This can improve population diversity and has shown to be beneficial for large-scale optimization.Chu et al. [68] used PCA to parts of the space not spanned by the current population resulting in improved exploration.In a similar way, Chu et al. [69] also used PCA to find lost dimensions and promote search in the less explored areas due to lost dimensions.
4) Space Partitioning: Zhang et al. [70] partitioned the space by grouping the dimensions into segments.Then, some newly designed operators are assigned to each segment to update those segments.These operators are designed to improve population diversity and avoid premature convergence.Zhao et al. [71] proposed to form subswarms which work independently during the search.Then, the subswarms are randomly changed to enlarge their neighborhood and promote information exchange and population diversity.This improves exploration but may deteriorate exploitation, which is compensated for by means of local search.Cheng et al. [64] proposed space partitioning with the aim of identifying and abandoning low-activity areas and focused the search effort in high-activity areas.
Balancing exploration and exploitation and maintaining population diversity are generally central to the design of effective optimizers for large-scale optimization.In this section, we have seen that this plays a crucial role in both DE and PSO.Novel means of using population topologies [46], sampling methods, parameter adaptation, and space partitioning are needed to further improve these algorithms for large-scale optimization.

III. APPROXIMATION AND SURROGATE MODELING
Solving an approximation of a problem can potentially be more viable than obtaining a solution for its original highfidelity model.In other words, the aim of approximation is to simplify.In optimization, this simplification is either achieved by means of applying some transformation to the objective function (often to reduce the dimensions), or by building a model of the objective function or its constraints [72], i.e., a metamodel to act as a surrogate to the original complex problem (Fig. 3).Metamodels or surrogates are used to reduce the computational overhead of optimizing expensive objective functions.In recent years, one approach to large-scale global optimization is to treat it as an expensive optimization problem and use metamodels to solve it [73].
Metamodels are built and refined based on sampling of the objective function.In high-dimensional spaces, the accuracy of the model drops significantly due to the limited sample size upon which the model is built.To alleviate this problem, several studies use some form of problem decomposition to break the problem into a set of lower dimensional subproblems, each of which is approximated using a metamodeling technique, such as radial basis functions or the Gaussian processes.Due to problem decomposition, it is clear that the problem structure and variable interaction plays an important role in building an ensemble of surrogates.In some cases, the metamodeling itself is used to identify separable and nonseparable components of a problem [74].For example, [75] used cut-HDMR to detect the components [75].Then, a multisurrogate strategy is used to model the nonseparable components.In other cases, variable interaction analysis algorithms, such as differential grouping [76] is used to decompose the problem and the subsequent subproblems are then approximated using metamodeling techniques [77].Werth et al. [78] also proposed a sliding window approach over the decision vectors to reduce the dimensionality of the problem.They use LINC-R to discover the variable interaction structure of a subset of the decision variables falling within the sliding window.Then, a surrogate-assisted algorithm is used to optimize over those variables.
Metamodeling and problem decomposition are mutually benefiting approaches to solve large-scale optimization problems.Problem decomposition makes it possible to build more accurate metamodels, given the limited samples, while metamodels can help with the issue of evaluating partial solutions in a divide-and-conquer paradigm.In cooperative coevolution and other divide-and-conquer paradigms, partial solutions need to be evaluated in the context of other partial solutions to form a complete solution.The issue of estimating the fitness of partial solutions was studied by Wang and Gao [79] using fixed auxiliary functions with no dynamic metamodeling mechanism.Several studies use metamodeling as a means of reducing the overhead of matching partial solutions and their re-evaluation for cooperative coevolution [77], [80], [81] and other divide-and-conquer paradigms, such as random projections [73].
In addition to modeling of subproblems, metamodels have also been used to balance the global search (exploration) and local search (exploitation) efforts.Metamodels have been used in the competitive swarm optimizer [41] to approximate the fitness of neighboring particles of a particle with a known fitness [82].Sun et al. [83] proposed to balance the exploration and exploitation efforts by means of building local and global surrogates.For the exploration part, they use social learning PSO [53], which has good global optimization properties in conjunction with radial basis functions capable of capturing the global profile of the objective function.For exploitation, they use a variant of the fitness approximation method proposed by Sun et al. [82] in conjunction with PSO for local search.
Beside building an explicit model of the objective function, as is the case with metamodeling, approximation can be built into problem representation [84] or be achieved by means of dimensionality reduction through transformation [85]- [91] or finding intrinsic dimensions of a problem [92].Wang et al. [93] combined the benefit of metamodels and dimensionality reduction by using autoencoders to find lower dimensional features of graph embedding problems and use them to construct a surrogate model to approximate the robustness value of large-scale graph networks.Principal component analysis has also been used to identify a lower dimensional representation of the probabilistic Gaussian models of EDAs [88], and the convergence variables on multiobjective problems [89].The variable reduction strategy is another means of representing the decision variables of an objective function or its constraints based on a smaller subset of core decision variables [90], [91].The random projection theory, which was covered in part I of the survey as an implicit way of exploiting the problem structure, can be used for dimensionality reduction in the context of EDAs.The weighted optimization framework, inspired from adaptive weighting by Yang et al. [94], is another way of transforming the problem into a lower dimensional one (see Section VI-C).

IV. INITIALIZATION METHODS
The random initialization of a set of candidate solutions is at the heart of the existing metaheuristic algorithms.The aim of initialization methods is to make the best use of random number generators or other sampling techniques to cover the vast search space more uniformly.This section covers the studies on random initialization for large-scale global optimization.
A wide range of population initialization methods have been employed by evolutionary algorithms [95], [96].There are various conclusions, conflicting at times, on the effect of initialization methods on large-scale optimization [97]- [99].Kazimipour et al. [97] studied the effect of advanced initialization methods on large-scale optimization.The study suggested that EAs are more sensitive to initialization in highdimensional spaces than in lower dimensional ones, regardless of the population size.They also reported that the pseudorandom number generator is inferior to advanced initialization methods in high dimensions.A follow-up study showed that the effect of advanced initialization methods becomes marginal when the parameters of the optimizer are properly set [98].A systematic study of advanced initialization methods on a DE variant, DE/rand/1/bin [3], showed that when the parameters of the algorithm are set close to their optimal, the statistical difference between random number generators and other initialization methods becomes insignificant.
Segredo et al. [100] showed that although overall distinction between random number generators and advanced initialization methods fades away in high-dimensional spaces, there is still a significant difference between them when best case and worst case performances are taken into account.They therefore concluded that the choice of the initialization method is of crucial importance, especially when a limited number of runs are allowed.
Kazimipour et al. [99] used centered L 2 discrepancy to measure population uniformity as a function of population size and the dimensionality of the space.They reported that the loss of population uniformity (hence diversity) due to curse of dimensionality is the dominant factor in the performance degradation of optimization algorithms, regardless of the choice of the initialization method.Putting differently, it is the geometric peculiarities of high-dimensional spaces that affect all initialization methods.For example, it is well known that the contrast in the distance between randomly chosen points diminishes as the dimensionality of the space increases [101], [102], which has serious implications on various initialization/sampling techniques.Consequently, Kazimipour et al. [99] recommended the use of advanced initialization methods only when the population size and the problem dimensionality are low.

V. PARALLELIZATION
In this section, we review the algorithms that rely on CPU and GPU parallelization to improve solving large-scale global optimization problems.

A. Historical Context
Cantú-Paz and Goldberg [103] studied the scalability of parallel single-and multipopulation GAs.Their goal was to find the optimal number of processors that minimizes the runtime.Their analysis showed that the number of processors that minimizes the execution time is proportional to the square root of the population size and the objective function evaluation time.Munetomo et al. [104] also investigated the use of parallel processing for linkage learning and proposed a parallel implementation of the LINC linkage learning algorithm called pLINC (see part I of the survey).They also proposed a twolevel GA with a series of intra-GAs operating on the linkage groups identified by pLINC, and an inter-GA that operates at a higher level and treats the linkage groups as a whole.

B. Large-Scale Cases
In the context of large-scale optimization, two types of parallelization are common.The first type is specific to a particular EA, and the second type is a generic framework applicable to a wide range of EAs most of which are based on a divide-and-conquer paradigm by means of problem decomposition.
In the algorithm-specific department, Mendiburu et al. [105] proposed a parallel master-slave implementation of several binary and continuous EDAs based on a Bayesian network model using message passing interface (MPI) and POSIX threads.They parallelized the learning phase or the modelbuilding process, which often takes the maximum proportion of the execution time and tested their algorithm on 500-D binary problems and 1500-D continuous problems.A drawback of this study is the use of very simple benchmark problems, such as OneMax for the binary case and the sphere function, which is fully separable, for the continuous case.Wang et al. [22] proposed a parallelized version of DE based on GPU parallelization and tackled continuous problems of up to 1000 dimensions.They also observed that with a fixed population size the speed-up rate decreases as the dimensionality of the problem increases.Iturriaga and Nesmachnow [106] proposed a parallel version of compact GA for CPU/GPU architectures and tested it on OneMax and noisy OneMax with up to one billion variables.They also proposed an asynchronous model on GPU, which is only suitable for largescale separable problems.More recently, Duan et al. [107] proposed a spark-based software framework for parallelization of various PSO implementations.Their proposed parallel PSO showed a superlinear speedup and was tested on continuous benchmark functions with up to 10 5 dimensions as well as on expensive functions.Cao et al. [108] proposed a parallel quantum-enhanced DE by parallelizing the fitness evaluation of individuals.
Lastra et al. [109] proposed a GPU-based MA-SW-Chains [110], a memetic algorithm for large-scale global optimization (see part I of the survey), by parallelizing all major components of the algorithm, such as fitness function evaluation, crossover, local search, random number generation, and population sorting.In another study, Cano and García-Martínez [111] proposed an improved GPU-based model for MA-SW-Chain and tested it on a scaled version of the CEC'2013 large-scale benchmarking suite on functions with up to 100 million decision variables.Cano et al. [112] proposed a parallel MA-SW-Chains by adapting it to the MapReduce framework to tackle problems with up to 10 million decision variables.The local search component of the algorithm is done using a divide-and-conquer strategy by performing local search for a subset of the decision variables.
Multipopulation algorithms are common in parallelizing many metaheuristics for large-scale optimization.Ge et al. [35] proposed a multipopulation topology-based island model with a master-slave paradigm to solve large-scale problems.Wang et al. [50] proposed a distributed PSO based on randomly formed equally size subpopulations, which are co-evolved using a master-slave paradigm.Yang et al. [49] proposed a distributed swarm optimizer-based multipopulation master-slave model where the elites of each subpopulation are used in the velocity update rule.Su et al. [113] proposed a parallel multiobjective algorithm for community detection.They first identify the key nodes in the network graph and the communities associated with the key nodes are then detected in parallel using a multipopulation model.
Problem decomposition into lower dimensional subproblems is central in several recent parallelization frameworks.Among them, Cao et al. [114] proposed a distributed parallel cooperative coevolutionary algorithm for solving large-scale multiobjective problems.They used a variant of differential grouping [115] to decompose the decision space into smaller components, which are optimized in parallel using a two-level parallelization structure based on the MPI.The experimental results are based on 1000-D DTLZ and WFG test functions.De Falco et al. [80] proposed a decomposition-based parallel model for solving expensive large-scale continuous optimization problems.They use the random grouping decomposition method and build a separate surrogate (metamodel) for each component of the problem.The components are then solved in parallel using cooperative coevolution.The proposed algorithm was tested on problems with up to 1000 dimensions.Yang et al. [116] argued that decomposition-based methods, despite their modular nature, cannot be readily parallelized due to the defects in how partial solutions are evaluated.They show that the objective function used to assign a fitness to a partial solution is not consistent with the ideal fitness assignment.They address the problem of fitness assignment to partial solutions for divide-and-conquer methods by appealing to a parallel framework called naturally parallelizable divide and conquer.
Decomposition-based and multipopulation parallelization methods can also be combined to solve large-scale problems.Hybrid two-way parallelism has also been used for large-scale optimization.These often combine: 1) parallelism by means of problems decomposition, i.e., the problem is decomposed into several lower dimensional subproblems and 2) parallelism by means of population distributions, i.e., a distributed pool model [117].Jia et al. [118] proposed such a two-way algorithm that controls the resource allocation by adapting the subpopulation sizes as well as the number of iterations a particular component (subproblem) is optimized.Another two-way parallelism decomposes the problem into several components based on the variable interaction analysis.Each component is subsequently divided into subpopulations each receiving a processor for optimization.A resource allocator then prioritizes processor allocation as a function of components' contribution toward the overall improvement of the objective function.

VI. RELATED RESEARCH TOPICS
In the previous sections, we reviewed common approaches to large-scale global optimization.In this section, however, we take a problem oriented perspective and discuss several problem areas arising in the context of large-scale global optimization, such as the problem of overlapping components (Section VI-A), resource allocation and the imbalance problem (Section VI-B), decision space scalability of multiobjective optimization (Section VI-C), and the scalability of constrained optimization problems (Section VI-D).The section concludes with a discussion on benchmarking large-scale optimization algorithms and a brief review of their real-world applications (Section VI-E).

A. Overlapping Problems
The importance of the problem structure and how it can be exploited in various ways were discussed in part I of the survey.The decomposition approaches covered in part I of the series are mostly suited to partially separable problems, i.e., those with distinct independent lower dimensional components.However, there are problems with the sparse variable interaction structure, which are not partially decomposable.These problems, which we refer to as overlapping problems, occur in many application areas, such as multidisciplinary design optimization [119] and concurrent engineering [120].Multiobjective optimization problems can also be seen as overlapping problems due to shared decision variables among the objective functions [121].This is particularly the case when a scalarization technique is used to convert them to a series of single objective optimization problems.Constrained optimization problems may have overlapping interaction structures [122] or may become overlapping depending on the constraint-handling techniques used to handle them.The variable interaction structure of the overlapping problems can be discovered using the methods outlined in part I of the survey.However, exploiting the structure is a more challenging task as compared to partially separable problems.Despite the importance of overlapping problems, very few works have been dedicated to large-scale overlapping problems [78], [123], [124].This section reviews some of such techniques that can help with the scalability of algorithms for large-scale global optimization.
Some approaches rely on breaking selected interactions with the aim of converting the overlapping problems into partially separable ones [125]- [127] or by means of special crossover operators that take the overlapping nature of the problem into account [128], [129].Munetomo and Goldberg [125] used monotonicity checking to identify the variable interaction structure of the objective function and proposed a metric to measure the linkage tightness with the aim of breaking weak interactions.Yu et al. [130] used an information-theoretic interaction detection mechanism to detect the problem structure and used an entropy-based measure of a component or building block to identify and break weak interactions.Yu et al. [131] assumed that the interaction structure is given and designed a crossover operator that partitions the interaction graph into two subgraphs such that the disruption of overlapping components is minimized.Sun et al. [126] proposed a variation of recursive differential grouping [132], RDG3, which limits the dimensionality of components forcing some interactions to be broken as a consequence.Li et al. [127] proposed a decomposition method based on spectral clustering, which takes the strength of interactions into account and breaks some weak interactions such that intergroup interactions are minimized and the intragroup interactions are maximized.
Thierens [128] proposed to use hierarchical clustering to represent the interactions using a linkage tree, which is subsequently used to perform recombination.Bosman and Thierens [129] proposed an improved version of linkage tree recombination to eliminate superfluous hierarchical linkage relations.Experimental results have shown that this type of recombination performs well on overlapping problems.
In the context of cooperative coevolution, overlapping components or groups with shared decision variables is a common way of solving overlapping problems.Sun et al. [123] used monotonicity detection to identify the structure of an n-dimensional problem and form n groups, one for each decision variable containing all other variables it interacts with.These n groups, which are not necessarily mutually exclusive, are optimized in a round-robin fashion within a CC framework.Due to the nonexclusive nature of the groups, this algorithm takes, to some extent, the overlapping nature of the problem into account.Jia et al. [133] used the variable interaction analysis to identify the underlying components of the objective function and their shared variables.They then use a contribution-based mechanism to promote components with a higher contribution toward improving the objective function and assign the shared variables the components with large contributions.
Werth et al. [78] used a sliding window mechanism and optimized the variables inside the window to reduce the dimension of the problem.The problem structure is taken into account by iteratively constructing the interaction matrix of the problem using LINC-R; however, instead of finding the entire matrix, only the interactions within a given sliding window are considered at each iteration.To deal with overlap, the Cuthill-McKee algorithm is used to reduce the bandwidth of the matrix, which places interacting variables close to each other.Song et al. [124] proposed overlapped cooperative coevolution that uses a mechanism called delta disturbance to find the most influential variables and distribute them among the existing components.Although this algorithm was not intended for overlapping problems, the replication of influential variables within all components can have a positive effect on solving overlapping problems.However, this has not been verified empirically on overlapping problems.Song et al. [134] used a similar mechanism and identified important variables that can participate in multiple components to solve large-scale virtual network embedding problems.
Factored EA [135] is another framework with the capacity to decompose a problem into a set of lower dimensional subproblems.It can mimic CC as its special case and has the capacity to define overlapping components suitable for solving overlapping problems.FEA can be an effective method for solving the overlapping problem if the problem structure is known a priori.The performance of FEA remains to be checked on large-scale overlapping problems such as those proposed in the CEC'2013 large-scale benchmark suite.
The Bayesian optimization algorithm (BOA), uses Bayesian networks to represent the problem structure, which is capable of capturing overlapping components [136].Although modeling the Bayesian network is computationally expensive, its flexibility makes it a good choice for solving overlapping problems.Empirical evidence suggests that BOA performs better than tightness detection [137].Clustered EDAs are also among the implicit methods that improve the identification of problem structures as compared to canonical EDAs.Emmendorfer and Pozo [138] proposed a cluster-based EDA that uses the notion of concept-guided combination to better capture and exploit problem structure.This technique was shown to outperform models based on Bayesian networks.

B. Resource Allocation and the Imbalance Problem
The efficient use of computational resources is of significant importance in large-scale global optimization.The contribution of a decision variable or a group of decision variables, belonging to an underlying subfunction within the objective function, can have a varying degree of influence on the function output.This characteristic, which is often called the imbalance problem, can have a detrimental effect on the optimization performance if not handled properly.For example, Chuang and Chen [139] showed that the model-building process of EDAs is affected by the imbalance problem.They reported that the selected individuals based on which the probabilistic model of EDAs is updated lack the necessary information about the linkage structure of some parts of the problem.Omidvar et al. [140] also showed that the imbalance problem renders the round-robin optimization policy of cooperative coevolution suboptimal.
Although the imbalance issue has implications in many areas, such as multiobjective optimization [141], [142], constraint handling [122], and dynamic optimization [143], it has mostly been studied in the context of optimal component selection policy of cooperative coevolution.Contribution-based cooperative coevolution (CBCC) [140] is the first algorithm of this kind.To deal with the imbalance problem, CBCC and other contribution-aware algorithms first need to estimate the contribution of components, and second devise an exploration/exploitation policy to update the contribution of components and to optimize the influential components longer.The algorithms which will be reviewed in the rest of this section differ in the way they handle these two aspects.
1) Problem Decomposition and Quantifying Contributions: Problem decomposition and quantification of contributions are important prerequisites of an effective resource allocation policy.Under the black-box assumption, all we can observe is the objective value and how it changes over time.In the literature, the improvement on optimizing a subset of the decision variables (a component) can have on the objective function value is taken as a unit of improvement or contribution.This value can be quantified for an arbitrary subset of the decision variables irrespective of whether or not they belong to an underlying subfunction.For a partially separable problem, if the ideal decomposition is known, one can study the effect of a single component on the objective value by freezing all other components.However, this may not be possible for an overlapping problem (see Section VI-A), where an optimal decomposition may not be known.It is therefore apparent that problem decomposition and estimation of contributions are closely interconnected.Most algorithms define the contribution of a component to be a function of the improvement it makes on the overall objective value when it is optimized for a predetermined number of iterations while all other components are kept constant.Let δ (t) k be the improvement we get at time t when the kth component is optimized for τ iterations.Based on this definition, the two original versions of the CBCC algorithms (called CBCC1 and CBCC2) define the contribution of the kth components to be (1/T) T t=1 δ (t) k , i.e., the average of all previous improvements up to the current iteration.Due to the nonstationary nature of the underlying distribution of the contributions, some algorithms defined the contribution as a moving average over the last L iterations.CBCC3 [144] used the extreme case of L = 1, while in the case of CCFR [145] L = 2. CCFR2 [146] improves upon CCFR by averaging the improvements per function evaluations to account for unequal subpopulations.CCFR2 [146] and some other algorithms [133], [142] exponentially decay the effect of historical contributions.Ren et al. [147] defined the contribution as a function of both δ (t) k and the standard deviation across all components.Some authors suggested various normalization of δ (t) k [118], [133], [142], [148], [149] as the contribution of a component.
In addition to δ (t) k defined previously, other measures of contributions have also been proposed.Global sensitivity analysis techniques, such as Morris screening is used in several works as the contribution measure for individual variables as well as components [150]- [152].Delta disturbance is a perturbation method proposed to measure the contribution of individual variables and finding the most influential variables for further optimization.Using the plain fitness of a component as the contribution of a component has also been suggested [153].
2) Resource Allocation Policies: A simple allocation policy, the variations of which are used by many algorithms, is to complement the round-robin policy of canonical CC by one or more iterations of optimizing the highest contributing component (as measured by the methods outlined in Section VI-B1 and Table I).CBCC1 [140] is the most conservative approaches in which the round robin is followed by only one episode of optimizing the highest contributing component (also employed by SACC1 [152].CBCC2 [140] is greedier and exploits the highest contributing component until no improvement is observed (also employed by SACC2 [152]).This is shown to be an unstable policy because the contribution of initially best component may not remain the best for the rest of a run.CBCC3 [144] addresses this issue by optimizing the highest contributing component until its contribution drops below the second best.This has the effect of equalizing contributions.CBCC3 also randomly enters an exploration phase where all components get a chance of updating their contributions.CCFR [145] and CCFR2 [146] use a similar equalization strategy and includes a stagnation detection mechanism to avoid optimizing stagnant components.FCRACC [147] is similar to CBCC3 in which it exploits the best component, but its method of quantifying contributions differs (see Section VI-B1).Meselhi et al. [149] used round robin followed by a fuzzy rule-based allocation based on the contribution and population diversity.Shen et al. [142] used a roulette-wheel selection mechanism based on the contributions.Jia et al. [133] optimized all the components whose contribution is more than half the best contribution.Although not specifically designed to address the imbalance problem, overlapped CC [124] replicates influential decision variables in more than one component, which has the effect of optimizing influential variables more often.
In addition to the heuristics described above, some algorithms define the resource allocation policy as a function of the estimated contributions.SACC3 [152] determines the number of times a component is optimized (τ ) as a function of its effect as measured by Morris screening.CCAOI [148] normalizes the contributions according to the Gini index and allocates the computational resources accordingly.DCCA [118] uses a distributed model and allocates more CPU instances to better contributing components.The population size and τ are then defined to be functions of the number of CPUs assigned to a component.Bandit-based CC (BBCC) [154] uses multiarm bandit approaches, such as -greedy, SoftMax, or upper confidence bound (UCB) to select the components based on their contributions aiming at balancing exploration and exploitation in a more systematic way.
3) Component Selection as Multiarmed Bandit Problem: Despite being effective in outperforming the canonical CC, the contribution-aware algorithms discussed so far are based on a set of heuristics derived form empirical observations with minimal theoretical basis.Some authors proposed that the component selection policy of CC can be treated as a multiarmed bandit (MAB) problem [154], [155].Among them, Kazimipour et al. [154] defined and mapped the building blocks of a contribution-aware CC into a MAB framework.This framework, BBCC, has the flexibility to mimic the previously described algorithms as a special case.BBCC treats the contribution of a component as the utility or the value function in reinforcement learning.This estimated contribution or long-term utility is defined to be a function of immediate improvements or rewards measured by quantities, such as δ (t) k .Given this interpretation, a wide range of contribution estimators, such as moving average (simple, weighted, or exponential), rank-based, or hybrid estimators can be used.The component selection policy is also responsible for balancing between exploration and exploitation, which can be done using a wide range of existing selectors, such as -greedy, -first, GreedyMix, LeastTaken, SoftMax, UCB, and other similar selectors widely used in the multiarm bandit and reinforcement learning literature.The simplest instance of BBCC with a normalized δ (t) k , and simple averaging as the contribution estimator, and -greedy has shown to outperform all CBCC family of algorithms, as well as CCFR and MOFBVE on the CEC'2013 large-scale benchmark suite.

C. Multiobjective Optimization
Multiobjective optimization problems are prevalent in a wide range of application areas [156].As the name implies, these problems have two or more conflicting objectives causing them to have multiple tradeoff solutions known as the Pareto-optimal solutions.The scalability of multiobjective problems can be studied in either the objective space [157] or the decision space.The former refers to the effect of the number of objective functions on the performance of the algorithms [157], while the latter is concerned with the scalability of each objective function with respect to its number of decision variables.In this section, we address the scalability of multiobjective problems in the decision space, which has attracted attention in the last decade [158]- [160].
Large-scale multiobjective approaches can be seen in four major categories.
1) Problem decomposition where the aim is to break the problem into a set of lower dimensional subproblems in the decision space.2) Problem transformation where the aim is to reformulate the original problem into a single lower dimensional problem.3) Operator design where the aim is to devise more efficient solution generation mechanisms.4) Problem reduction and intrinsic dimensions where the aim is to find a lower dimensional latent space embedded in a higher dimensional sparse space.Problem decomposition approaches, not to be confused with scalarization techniques, such as Tchebychev [161], operate in the decision space and aim at forming a set of smaller and more manageable subproblems.Such decompositions are often done by considering the interaction structure of the decision variables [115] or by grouping the variables based on their effect in the objective space, i.e., those pertaining to the convergence of solutions toward the Pareto-optimal front, and those pertaining to diversity of the solutions on the Pareto-optimal front [162], [163].Problem decomposition by means of variable interaction detection is a challenging task in large-scale multiobjective optimization for the following major reasons.
1) The lack of consistent partially separable grouping across all objectives.
2) The effect of problem formulation on variable interaction [164].For example, the use of scalarization techniques may result in an sparse but overlapping interaction structure due to shared decision variables among several objectives (see Section VI-A for more information about overlapping problems).
3) The cost of variable interaction detection for several objectives.Cooperative coevolution has been used with several decomposition strategies, such as random grouping [165], [166], multilevel random grouping [167], differential grouping [114], [168], and monotonicity detection [57], [162] to solve large-scale multiobjective problems.In some cases, several variable decomposition methods are combined [142] or completely new ones are proposed [169].For instance, Ma et al. [170] proposed to use a convergence relevance degree to decompose the problem and Wang et al. [169] used a tensor canonical polyadic decomposition to divide the d-order tensor of decision variables into a set of uncorrelated components.Some other decomposition techniques [114], [162], [163], [168], [171] divide the decision variables based on their dominant role in the optimization problems: convergence of solutions on the Pareto-optimal front and diversity of solution on the Pareto-optimal front.In such techniques, it is customary to further divide the convergence matrix using variable interaction methods described before.
An alternative to problem decomposition is to transform the original large-scale problem into a single lowdimensional problem.This transformation can be implemented using well-known methods such as PCA to obtain a lower dimensional representation of the convergence variables [89], regression analysis to represent a subset of variables as a function another subset [172], or other reformulation techniques [86], [173].One such reformulation, which has gained attraction in recent years and has shown to outperform some state-of-the-art methods [174], is the weighted optimization framework (WOF) [86], [175].Inspired by the notion of adaptive weighting [94], WOF reduces the dimensionality of the decision space by forming several groups of the decision variables, often using some variable grouping method [176], and transforming them using a given transformation function parameterized by a weight vector.Problem reduction is another approach where the algorithm attempts to find the intrinsic dimensionality of the problem, which is often substantially lower than its nominal dimensions.Sparse multiobjective problems arise in many practical application areas where the decision value of many Pareto-optimal solutions is zero.The essence of these algorithms is to find the sparse distribution of the decision variables using techniques, such as autoencoders [177] or pattern mining [178], and devising mechanisms capable of taking the sparsity into account [179], [180].
The previous approaches stated above are all based on operating in a lower dimensional space.There are also a range of algorithm-specific ways traversing the search space more effectively.For instance, Yi et al. [181] studied and improved the crossover operator of NSGA-III [182].Some studies proposed new update strategies for PSO to improve its convergence and diversity properties for largescale multiobjective problems [42], [57], [183].Similarly, quantum-enhanced DE [108] and variable-importance-based DE [141] have also been proposed to tackle high-dimensional multiobjective problems.Designing better local and neighborhood search mechanisms [113], [184], [185] and better solution selection mechanisms [186], [187] are other ways of improving the overall search efficiency.
Multiobjective algorithms can be categorized into dominance-based, decomposition-based, or indicator-based approaches based on how they handle the multiplicity of objective functions.Table II summarizes the large-scale multiobjective algorithms based on these three approaches.Fig. 4 shows the percentage of each approach used in Fig. 4. Proportion of papers according to Table II using different approaches to multiobjective optimization.the large-scale multiobjective literature.As can be seen, dominance-and decomposition-based approaches are the most dominant and indicator-based approaches are the least explored.Here, generic refers to the frameworks which are neutral to the choice of the optimizer.

D. Constraint Handling
Constraints are indispensable part of many real-world optimization problems [195].As a result, a wide range of constraint-handling methods has been proposed for evolutionary algorithms and nature-inspired metaheuristics [196].Despite the plethora of constraint-handling techniques, they suffer from the curse of dimensionality and very few studies have been dedicated to the topic of scalability in constrained optimization.Constraint-handling methods have the following scalability challenges.
1) High-dimensional objective and/or constraint functions.
2) The dependence between the number of constraints and the number of decision variables.For a largescale problem, this may result in a highly constrained problem.
3) The complex problem structure due to shared decision variables among the objective function and the constraints.4) The effect of constraint-handling method on the problem structure and variable interaction.For example, a simple penalty method can convert a partially separable problem into an overlapping one (see Section VI-A).Constraint-handling techniques used in large-scale global optimization are mostly based on problem decomposition and variable interaction analysis.Accurate decomposition has been shown to have a significant impact on reducing the number of constraint violations [197].Fitness difference minimization [198] and the differential grouping family [115] are two of the most widely used decomposition methods in large-scale-constrained optimization.
Sayed et al. [198] used a fitness difference minimization approach to analyze the interaction structure of the objective and constraint functions and decompose them into a set of smaller subproblems.Aguilar-Justo and Mezura-Montes [199] improved upon [198] and used an aggregate function of all constraint violations, rather than the individual constraints, for interaction analysis and problem decomposition.A problem of fitness difference minimization methods is the need for specifying the number or the size of components.To fix this, Aguilar-Justo et al. [200] proposed to evolve the best arrangement of the decision variables as well as the number of components using GA to solve large-scale-constrained problems.
Finite-difference decomposition methods (part I Section II-B) are generally more accurate than fitness difference minimization.A study shows that one such algorithm, differential grouping version 2 [115], is more robust and, therefore, better suited to complex highly constrained problems [201].Blanchard et al. [202] used a variant of differential grouping, IDG [203], to form an interaction structure matrix for the objective function and the constraints.The aggregate interaction matrix of interactions is then used to decompose the objective as well as the constraints.Xu et al. [122] also proposed a coevolutionary algorithm based on differential grouping and used a contribution-based mechanism to allocate resources to components based on their contributions and degree of constraint violations.
In addition to problem decomposition, other approaches, such as surrogate modeling [72], memetic algorithms [201], offspring generation [204], and boundary constraints violations [69], [205] are also studied in the context of large-scaleconstrained optimization.Approximation and variable reduction techniques also appear to be promising approaches for solving large-scale-constrained problems [90], [91] requiring further investigation.

E. Benchmarks and Applications
Although the ultimate goal of designing efficient algorithms is to solve real-world problems, their sheer complexity due to the entanglement of various aspects, such as constraint handling and dealing with mixed variable types, limits one's ability to conduct a focused study of a particular aspect of a problem common to a wider range of problems.Benchmark problems address this issue by capturing practical aspects, such as dimensionality, modality, structure, constraints, variable types, noise, etc., into a set of tunable well-defined functions [206].In the context of large-scale global optimization, the IEEE CEC large-scale global optimization benchmark suites [207]- [209] are the most widely used.All the benchmarks are based on a set of base functions popular in numerical optimization [210], such as Rastrigin, Rosenbrock, Ackley, Schwefel, Sphere, Elliptical, Griewank, and many more.
The CEC'2008 large-scale suite [207] contains a set of seven functions, which is tested in 100, 500, and 1000-D dimensions.This is a very small set and lacks modularity, i.e., systematic control over how decision variables are linked.The CEC'2005 suite, though not labeled as "large scale" is scalable but used mostly in low-dimensional (≈30-D) analyses.It introduces composite functions 1 through the weighted sum of a series of base functions, which can potentially cause partial interaction between decision variables.However, this is not implemented in a systematic way to give a full control over the problem structure.Herrera et al. [19] used a similar approach to propose a set of 19 functions used in the special issue of Soft Computing Journal on the scalability of evolutionary algorithms and other metaheuristics for large-scale continuous optimization problems [212].
To facilitate the study of variable interaction and decomposition-based algorithms, the CEC'2010 large-scale suite [208] introduced modularity into the benchmarks, where the number of component functions and their participating decision variables are known and controllable.The benchmark contains a set of 20 1000-D functions in three categories: 1) fully separable; 2) two types of partially separable functions with and without a fully separable component; and 3) fully nonseparable.The CEC'2013 large-scale suite [209] addressed the shortcomings of its predecessor by introducing nonuniform component sizes, imbalance among the contributions of components, and overlapping functions the components of which may not be disjoint.It also includes transformations, such as symmetry breaking, ill conditioning, and local irregularities proposed in the black-box optimization benchmark (BBOB) suite [213].Sun et al. [214] extended the CEC'2013 suite to make it more tunable.Other studies which paid attention to problems with overlapping components were conducted by Werth et al. [78] and Sayed et al. [215].
There are other scalable benchmarks that have been used to study various other aspects of large-scale optimization algorithms.COPS 3.0 represents a repertoire of scalable and constrained problems found in various areas of engineering and sciences.Goh et al. [216] used the EEG big data optimization problem and proposed a BigOpt2015 benchmark suite for large-scale multiobjective optimization.Cheng et al. [217] adopted the ideas proposed in [208] and [218] to propose a set of large-scale multi-and many-objective benchmark functions.Scalable benchmark functions for constrained optimization problems are very limited.Beside the constrained problems compiled by Dolan et al. [219], which are not widely used in the EC community, Sayed et al. [198], [215] also proposed a set of artificial benchmark functions for constrained problems.Recent studies also attempted to extend standard dynamic optimization benchmarks to study large-scale dynamic optimization problems [143], [220].
Applications: Benchmarks are used to ultimately help with designing and evaluating efficient algorithms for solving real-world problems.Some studies directly used real-world problem instances, in addition to artificial benchmarks, to compare algorithms.Table S-III of the supplementary material shows a set of high-dimensional optimization problems in a wide range of application areas.The problem types include real valued, integer/binary, mixed integer, and combinatorial, and about half of these include constraints.Among the approaches, divide-and-conquer methods, such as problem decomposition and coevolutionary algorithms are the most popular followed by hybrid methods (memetic algorithm, local search hybridization, and ensembles), which is consistent with our observations in part I of this survey series.Other approaches include parallelization, approximation and encoding schemes, and algorithm-specific sampling and variation operator design.Curse of dimensionality is the predominant source of difficulty among these application areas and the existence of other factors, such as constraints, noisy, and nonsmooth objective functions add to the complexity.For the continuous decomposition-based approaches, noise and nonsmooth functions appear to be an obstacle to an accurate variable interaction analysis.In the case of combinatorial problems, finding an effective decomposition is a major challenge in its own right.A challenge for approximation and encoding schemes is model resolution or granularity which mediates between the accuracy of the model and its complexity.Among the constrained problems, handling of infeasible solutions and the interplay between dimensionality and the number of constraints are two of the most important challenges.

Large-Scale Global Optimization Competitions:
In this section, we review the results of large-scale global competitions since 2008 when the first IEEE CEC competition on largescale global optimization was held.Table III lists the winners and runner ups.Based on the approaches to large-scale global optimization we outlined in this article, high performing algorithms almost exclusively belong to either memetic algorithms (part I Section III) or decomposition-based algorithms (part I Section II) with memetic algorithms and local search dominating the competition.
It is interesting to note that the algorithmic philosophy of these two approaches is orthogonal and complementary to each other.The premise of memetic algorithms is to balance exploration and exploitation by means of combining global and local search operators.Whereas the main premise of decomposition methods is interaction-aware space partitioning and dimensionality reduction.Consequently, memetic algorithms lack an intrinsic mechanism for dealing with variable interactions and systematic space partitioning, and decomposition methods lack an intrinsic mechanism for balancing exploration and exploitation.It is therefore not surprising to see that the effort of hybrid algorithms is focused on proposing novel ways of balancing exploration and exploitation, and the effort of decomposition methods is centered around finding accurate interaction detection principles and effective grouping.This leaves both approaches with major blind spots.The absence of interaction detection mechanism in hybrid frameworks puts an extra burden on the exploration process by searching the regions which could have been avoided through partitioning, and reduces the efficiency of dimensionwise local search due to ignored interactions.Most decomposition-based methods also discount the role of the component optimizer in balancing exploration and exploitation.This deficiency is partially compensated by contribution-aware algorithms (see Section VI-B) which balance exploration and exploitation at the component level.
The design biases of hybrid and decomposition-based methods stated above can also be seen among the competition winners and runner ups listed in Table III.MTS [221] uses orthogonal arrays combined with a mixture of local operators.Orthogonal arrays give an expansive initial coverage of the search space, which forms the basis for the subsequent local search process.MA-SW-Chains [110], IHDELS [226], and SHADEILS [227] combine a global search algorithm with a chain of iterated local search attempts to balance exploration and exploitation.MOS [224], [225], despite being a framework capable of hybridizing any set of operators, uses a mixture of global and local operators to control exploration and exploitation.Even the algorithms, such as MPS [229] and LSEDA-gl [222], which are not memetic by definition, have explicit elements of balancing the exploration and exploitation forces.MPS [229], for instance, proposes a mechanism to disentangle the exploration and exploitation mechanisms with the aim of minimizing failed and deceptive exploration attempts and maximize the successful ones.LSEDA-gl [222] also hybridizes heavy-tailed distributions such as Lévy to promote exploration with the classic Gaussian distribution to promote exploitation.
Among the competition winners and runner ups listed in Table III, LSHADE-SPA [228] and EOEA [230] are the only hybrid algorithms that consider both exploration/exploitation balance as well as problem decomposition.On the decomposition side, however, they both fall short of using an effective variable interaction detection method.LSHADE-SPA [228] uses the outdated random grouping [94] known to perform poorly on partially separable problems (see Section II of part I) and EOEA [230] uses a grouping strategy incapable of an effective space partitioning.Due to their emphasis on accurate interaction analysis, decomposition methods are historically not systematic with their choice of component optimizer, which limits their ability in maintaining good exploration/exploitation balance within the lower dimensional subspaces.This is perhaps why hybrid algorithms are dominant in competitions [231], [232].As a matter of fact, decompositionbased algorithms are capable of using any component optimizer, including memetic algorithms and other hybrids, to further improve their scalability.This is why the combination of accurate problem decomposition (RDG3 [126]) and good component optimizer (CMA-ES) resulted in superior performance by CC-RDG3 in 2019.Further evidence also suggests that several decomposition-based algorithms not present in competitions can outperform competition winners.For example, the decomposition-based algorithm proposed by Mei et al. [233] outperformed MA-SW-Chains [110], the winner of CEC'2010 competition on both CEC'2008 can CEC'2010 LSGO benchmark suites, and recursive differential grouping [132] outperformed MA-SW-Chains [110] and MOS [225] on the CEC'2010 and CEC'2013 LSGO benchmark suites.It is often stated that the cost of decomposition prohibits their use in large-scale settings.However, recent advances in variable interaction and grouping methods allows this to be achieved in O(n log n) in the general case and O(n) on separable functions (see part I Section II-B for more details).

VII. CONCLUDING REMARKS
In the two parts of this survey, we reviewed a wide range of population-based metaheuristics for large-scale global optimization in six major categories: 1) problem decomposition; 2) hybridization and memetic algorithms; 3) sampling and variation operators; 4) approximation and surrogate modeling; 5) initialization methods; and 6) parallelization.We reported on the state of the art and what has been achieved over the last decade.In this section, we change perspective and try to touch upon two major issues pertaining to the future of the field.
1) Where do we stand as a field and what are the potential pitfalls and challenges hindering the progress of the field?2) Where to go next?What are the pressing open questions and where more focus is needed?

A. Large-Scale Global Optimization: Pitfalls and Challenges
Big Picture: Despite the advances in various areas of largescale global optimization, sometimes their relation to the bigger picture is unclear.This is partly due to the lack of a clear measure of the progress in the field and lack of clarity about its grand challenges.The bulk of the research in the field is currently driven by showing statistically significant improvements over existing results with minimal reference to whether these so-called significant results are actually meaningful in real-world settings.There is also an overemphasis of the success stories, rather than giving insights into why an algorithm fails on a particular problem or a class of problems.This is partly due to the departure from the scientific method in conducting research in favor of an engineering approach where a comparison with the state of the art is encouraged.
The lack of a big picture may cause the field to focus too much on nice-to-have incremental research rather than addressing the core issues of large-scale global optimization.This deficiency currently manifests itself in at least three forms.
1) Emergence of New "Metaphor"-Based Algorithms: A wide array of metaphor-based metaheuristics have been proposed in recent years.These "novel" algorithms are often a marginal variation of an existing algorithm under the disguise of new terminology.In large-scale global optimization, some works claim novelty by simply applying one such new metaheuristic to solve some standard large-scale benchmark suite.This is very detrimental to the field and "take the field of metaheuristics a step backward rather than forward" [234].

2) Ad Hoc Improvements of Algorithms With Marginal
Scientific or Practical Significance: This type of work often present a minor variation of an existing algorithm, which statistically improves upon the previous results despite the magnitude of the difference being negligible for any practical purpose.As an example, applying a known parameter adaptation technique to dynamically control the parameters of a new metaheuristic algorithm falls short of addressing major challenges of the field.3) Theory-Practice Gap: Given the ever growing need for scalable optimization algorithms in a wide range of application areas, the gap between theory and practice in terms of the problem sizes currently being tackled is widening.In other words, the large-scale problems being studied now using the standard benchmarks are far from the large-scale problems faced in practice.To avoid falling prey to these defects, we need to check where we stand as a community in relation to identifying and addressing the grand challenges of the field and bridging the gap between theory and practice.This perhaps requires a separate quantitative in-depth investigation of the large body of reported results, which is outside the scope of this article.
Comparison: Despite the availability of relatively standardized and widely used benchmark suites, it is still hard to compare the reported results across a wider body of publications to be able to see the major patterns and trends.Despite some attempts to develop automated comparison tools such as the toolkit for automatic comparison of optimizers (TACOs), 2  we currently do not know the answer to questions such as the following: given a specific function or family of functions, which algorithm or class of algorithms performs the best and why?In continuous problems, especially due to the imbalance effect, the overall solution quality may seem poor, but the solution may indeed be close to the global optimum.Based on the reported results, we currently do not know how far the solutions are from the global optimum.
The large-scale global optimization competitions also acted as a venue to compare a wider range of algorithms, but their conclusions remain limited due to the absence of several stateof-the-art algorithms from the competitions.We do not know to what extent does the competition outcomes depend on the allotted number of objective function evaluations.Do the conclusions change if the algorithms are given less or more resources?
Answering some of the questions stated above can help in finding the recurring issues and bottlenecks and help with shaping the big picture and identifying the core issues of the field.
Adoption: Lack of streamlined and easy-to-use software packages make the adoption of the recent developments very difficult for practitioners or other researchers outside the field.The most recent algorithms are often sophisticated and hard to implement which is a stumbling block in the way of their wider adoption.have shown competitive results on high-dimensional learning problems [235], [236], research on devising populationbased algorithms to tackle large-scale learning problems is scarce [237]- [242].Population-based metaheuristics in general, and evolutionary algorithms in particular, are suited for environments that require hard exploration.As a result, they can be competitive in areas, such as neural architecture search [243], training of deep neural networks [236], and reinforcement learning [244].In reinforcement learning, for instance, evolution strategies have shown to perform better than the policy gradient on Atari 2600 games.These methods are particularly suitable when the effective number of time steps is long, the actions have long-lasting effects, and no good value function estimator is available [244].

B. Potential Areas for
Conversely, machine learning algorithms can be used in conjunction with large-scale global optimization algorithms to improve their scalability.For example, machine learning can be used as a general approach to learn and discover a problem's structural information from available data [245].The learned model can then be applied to unseen data for the purpose of classification or time-series prediction.An optimization process such as branch and bound can be modeled as a decision making process; hence, a machine learning model can be applied, to learn the most efficient and effective way [246].This will go a long way in helping an algorithm's ability to scale to higher dimensional problems.Similarly, machine learning algorithms can be used to reduce the size of an optimization problem before being tackled.For examples, some recent work on employing machine learning techniques to learn from known instances that contain optimal solutions in order to reduce the problem first, without losing the optimal solutions [247], [248].
2) Synergy Between Metaheuristics and Classic Mathematical Programming: Several classic derivativefree optimization algorithms have been successfully used as local search operators in the context of memetic algorithms (c.f.part I Section III).Other promising areas of research at the intersection of population-based optimization algorithms and classic mathematical programming are as follows.
1) Problem Decomposition: In classic mathematical programming, there exist some decomposition methods, such as column generation, Bender's cut, and Dantzig-Wolfe decomposition [249].These techniques can be effective under certain assumptions such convexity or linearity.An important question to ask is how one would combine the merits of metaheuristics with these decomposition methods to tackle real-world LSGO problems that are often nonconvex and nonlinear?As an example, machine learning has been used to learn when Dantzig-Wolf decomposition is effective on mixed-integer linear programming problems [250].2) Variation Operators: Many large-scale real-world optimization problems are combinatorial by nature, e.g., either discrete, binary, or mixed types.Examples include large-scale traveling salesman problems or other graphbased optimization problems.Designing metaheuristic variation operators (such as crossover and mutation) in order to produce new solutions from the existing ones can be a significant challenge.Most existing successful methods are conventional mathematical programming methods, such as branch-and-bound and branch-and-cut methods.Hybrid methods that combine the merits of metaheuristics and exact methods (e.g., taking advantage of the "shared information" in a metaheuristic population) are a promising direction [251]- [254].3) Exploiting Problem Structure: Exploiting the problem structure and gray-box optimization has shown to be effective ways of solving large-scale problems (part I Section II).These structural information can be used in the form of explicit decomposition or implicitly through model building.The challenge of explicit methods is the cost of offline variable interaction learning, which requires objective function evaluations and causes an overhead on the overall optimization cost.Another issue is that a crisp decomposition is sometimes impractical due to various forms of couplings caused by the existence of multiple objectives, overlapping components (shared variables among subfunctions), or coupling through constraints.Implicit methods also suffer from the accuracy of capturing the problem structure, especially when the problem size grows in size.Finding more efficient and effective ways of exploiting structural information, such as overlap, can have a significant impact on improving the scalability of optimization algorithms.
4) Noise, Dynamism, and Uncertainty: The scalability of optimization algorithms in the presence of noise, dynamical changes of the landscape, and uncertainty has scarcely been studied with only few papers addressing these issues [143], [220].These problem types pose a range of new challenges to the existing approaches to largescale optimization presented before.For example, variable interaction analysis methods are designed based on the assumption that the objective function is noiseless.The dynamical changes of the landscape can change the structural properties of the objective function, which makes it difficult for the explicit and implicit methods to exploit these information in an efficient manner.
5) Constraint Handling: Constraints are indispensable part of most real-world optimization problems; however, very limited studies have been dedicated to the effect of problem dimensionality on constraints [198], [202] (also see Section VI-D).There is still a lack of efficient constraint handling tools to cope with high-dimensional constraint functions or the cases where the number of constraints is a function of the dimensionality of the objective function [255], [256].In the later case, the problem may contain a large number of lowdimensional constraints.The field currently lacks scalable and controllable constrained benchmark problems either synthetic or based on real-world problems [195].

Fig. 1 .
Fig. 1. Outline of the topics covered in the two parts of this survey series on large-scale global optimization.

Fig. 2 .
Fig. 2. DE and PSO are two widely used metaheuristic algorithms used in large-scale global optimization.This figure shows major aspects of these algorithms studied for large-scale optimization problems.

Future Research 1 )
Synergy Between Optimization and Learning: Deep learning problems are in essence high-dimensional problems with the potential to contain millions or billions of decision variables.Although evolutionary algorithms 2 https://tacolab.org/

TABLE I LIST
OF CONTRIBUTION-AWARE ALGORITHMS WITH A SHORT DESCRIPTION OF THEIR METHOD OF ESTIMATING THE CONTRIBUTION OF COMPONENTS AS WELL AS THEIR RESOURCE ALLOCATION POLICY

TABLE II CATEGORIZATION
OF LARGE-SCALE ALGORITHMS WITH RESPECT TO THEIR STRATEGY OF HANDLING MULTIPLE OBJECTIVES

TABLE III WINNING
AND RUNNER-UP ALGORITHMS OF LARGE-SCALE GLOBAL OPTIMIZATION COMPETITIONS SINCE 2008