Dual Differential Grouping: A More General Decomposition Method for Large-Scale Optimization

Cooperative coevolution (CC) algorithms based on variable decomposition methods are efficient in solving large-scale optimization problems (LSOPs). However, many decomposition methods, such as the differential grouping (DG) method and its variants, are based on the theorem of function additively separable, which may not work well on problems that are not additively separable and will result in a bottleneck for CC to solve various LSOPs. This deficiency motivates us to study how the decomposition method can decompose more kinds of separable functions, such as the multiplicatively separable function, to improve the general problem-solving ability of CC on LSOPs. With this concern, this article makes the first attempt to decompose multiplicatively separable functions and proposes a novel method called dual DG (DDG) for better LSOP decomposition and optimization. The novelty and advantage of DDG are that it can be suitable for not only additively separable functions but also multiplicatively separable functions, which can considerably expand the application scope of CC. In this article, we will first define the multiplicatively separable function, and then mathematically show its relationship to the additively separable function and how they can be transformed into each other. Based on this, the DDG can use two kinds of differences to detect the separable structure of both additively and multiplicatively separable functions. In addition, the time complexity of DDG is analyzed and a DDG-based CC algorithm framework is developed for solving LSOPs. To verify the superiority of DDG, experiments and comparisons with some state-of-the-art and champion algorithms are conducted not only on 30 LSOPs based on the test suite of the IEEE CEC large-scale global optimization competition, but also on a case study of the parameter optimization for a neural network-based application.


I. INTRODUCTION
L ARGE-SCALE optimization problems (LSOPs), which are becoming increasingly ubiquitous in the research community and real-world applications, have attracted increasing attention in recent years [1]- [3]. Due to the "curse of dimensionality," a large number of decision variables make the landscape of LSOPs highly complex and very difficult to be optimized [4]- [7]. As evolutionary computation (EC) algorithms are efficient tools for solving various complex optimization problems [8]- [10], such as multimodal [11], [12]; multi-/many-objective [13], [14]; and expensive optimization problems [15], [16], many researchers have also studied powerful EC-based algorithms for solving LSOPs [17]- [19]. In this direction, the cooperative coevolution (CC) framework has achieved great success and, therefore, has been widely studied in recent years [20]- [22].
Inspired by the "divide-and-conquer" mechanism, the core idea of the CC framework is to decompose the LSOP into several nonoverlapped subproblems with lower dimensions and then optimize each subproblem using EC algorithms [23]- [26]. To better describe the idea of CC, Fig. 1 presents a general CC framework. As shown in Fig. 1, the CC framework mainly has three stages: 1) the decomposition stage, where the problem will be decomposed into several subproblems by the decomposition method; 2) the optimization stage, where the subproblems will be optimized by EC algorithms; and 3) the combination stage, where the subsolutions for corresponding subproblems will be combined to form the complete solution. As decomposition is the essential and critical stage in CC, the quality of decomposition can greatly influence the optimization results.
Therefore, to better achieve problem decomposition, many decomposition methods have been proposed and researched [27]- [30]. Generally speaking, existing decomposition methods can be roughly classified into two categories: static and dynamic decomposition methods [31]- [33]. Intuitively, static decomposition methods attempt to obtain satisfactory decomposition results before the optimization stage and keep the decompositions fixed during the whole optimization process [34], [35], while dynamic methods decompose the problem dynamically according to the information obtained during the optimization process [36]- [38]. Among the existing methods, differential grouping (DG) [27] and its variants have obtained great success in decomposing LSOPs, especially additively separable LSOPs. Therefore, many enhanced DG methods have been proposed in the literature [28]- [30].
However, as many existing DG-based methods are based on the theorem of function additively separable, their decomposition abilities will deteriorate if the problem is not additively separable. For example, (1) presents an objective function of a minimization problem as min x 1 ,x 2 g example (x 1 , x 2 ) = 2x 1 x 2 + 5x 1 + 14x 2 + 35 where we know that (1) can be decomposed into two subproblems of x 1 and x 2 for independent optimization because the optimal value of x 1 is exactly a constant (i.e., -5), regardless of the value of x 2 , and the optimal value of x 2 is also a constant (i.e., -2), regardless of the value of x 1 . Therefore, instead of optimizing the whole (1), we can separate variables x 1 and x 2 and then optimize subproblems g 1 and g 2 independently as min where in (2), x 1 is the variable while x 2 is considered a function parameter with a fixed value belonging to [−2, 2]. Similarly, (3) is a function of x 2 , while x 1 is considered a function parameter with a fixed value belonging to [−5, 5]. Therefore, after the decomposition, (2) is easy to obtain the optimal value of x 1 = −5, and (3) is also easy to optimize with x 2 = −2 as the optimum. However, the separable problem in (1) cannot be correctly decomposed by existing DG methods because the term "2x 1 x 2 " is not additively separable [27]. That is, existing DG-based methods will consider x 1 and x 2 as interacting and nonseparable and, therefore, group them in the same group, which is not suitable. The key issue behind this phenomenon is that the additively separable structure is not the only separable structure. In other words, even though a separable problem is not additively separable, it may be decomposed in other ways. Therefore, it is a potential and promising direction to explore a more general decomposition method for further improving the problem-solving ability of CC for LSOP. In fact, in many real-world applications, optimization problems can include variables with multiplicative interactions. For example, neural networks (NNs), including deep NNs and convolutional NNs [39], have become essential in various applications nowadays [40], where the NNs usually have a large number of parameters to be optimized to obtain better performance [41]. However, the large-scale and complex parameter optimization problem of NNs can contain many multiplicatively separable variables because the input of each layer will be multiplied by the parameters of the current layer to generate the input of the next layer. Therefore, besides the additively separable structure, it will be more promising to consider the decomposition of more kinds of separable structures including the multiplicatively separable structure.
With the above concerns, this article proposes a novel dual DG (DDG) method to achieve a more general and better problem decomposition. Compared to the existing DG-based methods, the advantage of DDG is that it can be suitable for not only additively separable problems but also multiplicatively separable problems, which can greatly expand the application scope of CC from single separable problems (i.e., additively) to dual separable problems (i.e., both additively and/or multiplicatively). For example, the problem given in (1) is multiplicatively separable and, therefore, it can be decomposed by the DDG, which will be described later as an example in Section III-A. Moreover, in Section III-A, we will first give the definition of a multiplicatively separable function and then mathematically show the relationship between the additively and multiplicatively separable functions and how they can be transformed into each other. Based on this, the details of DDG will be provided. After that, we will further develop a DDG-based CC algorithm for solving LSOPs and theoretically analyze the time complexity of DDG.
The major novelties and contributions of this article are summarized as follows.
1) The definition of the multiplicatively separable function and the relationship between it and the additively separable function are mathematically provided in this article. More importantly, two related theorems are then given and proved, which can guide the detection of the separable structure in multiplicatively separable problems. 2) Based on the given theorems, the DDG is proposed to achieve a more general decomposition ability for LSOP. By utilizing two different kinds of differences to detect separable structures, the DDG can decompose not only additively separable problems but also multiplicatively separable problems efficiently. Moreover, the time complexity of DDG is also analyzed and given in this article. 3) A DDG-based CC algorithm for solving LSOPs is further developed by combining the DDG with a classical and widely used CC framework.
To evaluate the proposed DDG and the DDG-based algorithms, experimental studies are conducted on 30 LSOPs, which are selected and generated from the widely used LSOP benchmarks in the latest IEEE CEC 2013 largescale global optimization (LSGO) competitions test suite [42]. Furthermore, some state-of-the-art decomposition methods and algorithms, including the champion algorithm, are employed in the experimental comparisons to show the superiority of DDG. In addition, this article also conducts a case study on the large-scale parameter optimization of an NN-based threecategory classification application, so as to further evaluate the real-world application potential of the proposed DDG.
The remainder of this article is organized as follows: Section II briefly introduces the background and related work, and Section III details the proposed methods and the time complexity of DDG. Experiments, including the settings, comparisons, and analyzes, are provided in Section VI. Finally, Section V presents the conclusion.

A. Separable Function and DG
The separable function is defined as follows. Definition 1 [28]: A function f (x) is partially separable for minimization with k independent components if and only if arg min arg min where x 1 , . . . , and x k are k nonoverlapped subvectors of x, and the "***" in the parentheses can be any value in the corresponding search space. Note that according to (4), the optimal value of x i (1 ≤ i ≤ k) should be the same no matter what the value of "***" is, and the combination of all optimal nonoverlapped subvectors should be the optimal solution to the original problem. Based on the above, if x i and x j (i = j) are two nonoverlapped subvectors of x that satisfy (4), any variable in x i and that in x j are separable and do not interact with each other. In addition, note that the "argmin" should be "argmax" for maximization problems.
As a special type of partially separable function, the additively separable function is defined as follows.
Definition 2 [28]: A function f is partially additively separable if it has the following form: where x 1 , x 2 , . . . , and x k are k nonoverlapped subvectors of x, functions f 1 , f 2 , . . . , and f k are subfunctions of function f, and D is the total dimension of x. Specifically, the function f is also called fully additively separable if k equals D, while it is regarded as fully nonseparable if k = 1. In (5), we can see that the change of a subvector x i will only change the value of f i and then change the original function f, but will not change the value on other subfunctions f j (i = j) because x i is not the variable in f j . Based on this, we have the following theorem: Theorem 1 [27]: Given an additively separable function f (x) with D-dimensional decision variable x = (x 1 , x 2 , . . . , x D ), for any real value a, b, c = b + δ(δ = 0), and d(b = d), if the following condition holds for two variables x i and x j : then x i and x j are additively nonseparable. Based on this theorem, the idea of DG is to check whether (6) holds for every two variables to determine the additively separable structure of an objective function. Specifically, if (6) is satisfied, then x i and x j are additively nonseparable, which will be regarded as interacting variables and grouped together. Due to the calculation error and the computational precision of the computer system, the DG method checks (8) instead of (6) in practical implementations.
where ε addi is the acceptance threshold for detecting additively separable variables.

B. Related Work
In this section, we briefly review the related work about variable interaction within the EC community. As briefly mentioned in Section I, decomposition methods can be roughly divided into two categories: 1) static decomposition methods [34] and 2) dynamic decomposition methods [36]- [41]. In general, both static and dynamic decomposition methods have advantages and disadvantages. Therefore, these researches are suitable for different situations [20]. As the proposed DDG in this article is a static decomposition method, the following contents primarily describe the related work on static decomposition methods.
Usually, a good decomposition requires the appropriate separable structure of the objective function. To analyze the separable structure, detecting and learning the interactions among decision variables are essential. Therefore, many variable interaction learning methods have been proposed. Chen et al. [43] proposed a variable interaction learning strategy for problem decompositions. In this strategy, each variable is initially considered as a separate group, and then the groups that affect each other are merged, which finally obtains the groups that are independent of each other. Omidvar et al. [27] proposed the classical DG method, which was for detecting the interactions among variables in additively separable functions.
To date, DG has shown great efficiency in problem decomposition because it can capture the interactions among variables in additively separable functions. However, DG cannot detect the indirect interactions between variables. For this problem, Mei et al. [30] proposed global DG (GDG) to detect the indirect interactions between variables. GDG uses a matrix to denote the interactions among variables, where each element in the matrix represents the interaction degree between two variables. After obtaining the matrix by checking (8), GDG performs the breadth-first or depth-first technique to identify the direct and indirect interactions between variables. Similarly, Sun et al. [29] proposed extended DG (XDG), which iteratively detected the interactions between every two variables and accordingly divided the variables into several groups.
Although the above methods are powerful for detecting variable interactions, these methods require a large number of fitness evaluations (FEs), which can be an expensive cost in solving LSOPs. To address this issue, many studies have been proposed for detecting variable detections using fewer FEs [28], [31]. For example, Hu et al. [34] proposed a fast interdependency identification mechanism to save many FEs. Furthermore, Omidvar et al. [28] proposed DG2, which reused some sample points to reduce the need of FEs in the original DG. In addition, Sun et al. [31] were inspired by a binary search and proposed a recursive DG (RDG), where the detections of variable interactions were performed in a binary recursive manner. Moreover, the improved RDG method, called RDG3, has also been proposed and studied for overlapping functions [32].
In addition to the above DG-based methods, other decomposition methods have also been studied in recent years. For instance, Ge et al. [33] proposed a two-stage variable interaction method, where a learning model was first employed to explore some knowledge and then a marginalized denoising model was adopted to obtain the overall variable interactions based on the knowledge obtained in the first stage. Wang et al. [44] proposed a formula-based grouping that assumed the formula of the objective function was known before the optimization. Liu et al. [45] proposed a hybrid deep grouping method that not only considered variable interaction but also variable essentials, which was suitable for decomposing nonseparable problems.
In conclusion, a considerable number of the above methods are DG variants or are based on Theorem 1, which are only suitable for additively separable problems [46], [47]. Different from these methods, the DDG proposed in this article is useful for not only additively separable problems but also multiplicatively separable problems.

A. Multiplicatively Separable Function
The definition of a multiplicatively separable function is given in this article as follows.
Definition 3: A function g is partially multiplicatively separable if it has the following form: where x 1 , x 2 , . . . , and x k are k nonoverlapped subvectors of x, functions g 1 , g 2 , . . . , and g k are subfunctions of function g, and D is the total dimension of x. Specifically, function g is also considered as fully multiplicatively separable if k equals D, while it is regarded as multiplicatively nonseparable if k = 1. Then, for the relationship between the additively separable function and the multiplicatively separable function, we can have two theorems, which are Theorem 2 and Theorem 3 as follows.
Theorem 2: Every additively separable function can be transformed into a multiplicatively separable function.
Proof: Given a partially additively separable function f (x) as defined in Definition 2, letting g(x) = e f (x) , then g(x) can be rewritten as According to Definition 3 and (10), g(x) is multiplicatively separable and, therefore, the proof is finished. Theorem 3: Every multiplicatively separable function, where the minimum value is larger than 0, can be transformed into an additively separable function.
Proof: Given a multiplicatively separable function g(x) as defined in Definition 3 and with the minimum value larger than 0, let f (x) = ln g(x), then, f (x) can be rewritten as According to Definition 2, f (x) is additively separable and, therefore, the proof is finished. That is, the logarithmic operation can transform a multiplicatively separable function into an additively separable function. It may be argued that some g i (x i ) may be negative in (11). However, as g(x) is positive, the number of negative subfunctions [e.g., g i (x i )], must be even, and we can always find the same number of positive subfunctions to substitute the negative subfunctions. For example, for any subfunction where newg i (x i ) is guaranteed to be positive. In addition, to make Theorem 3 easier to understand and to illustrate what it can do, we take the problem in (1) again as an example here. By using the logarithmic operation like (11), we can have where x 1 and x 2 belong to [−5, 5] and [−2, 2], respectively, as defined in (1). According to Definition 2, f example is actually an additively separable function, which can be decomposed by DG-based methods. In other words, Theorem 3 shows that we can use a novel method modified/enhanced from the additive difference to detect the separable structure in multiplicatively separable functions. To be more specific, with Theorem 1 and Theorem 3, we can determine the multiplicatively separable variables by checking whether the following equation holds: where f (x) should be positive. In practical implementations, we can use the (15) instead of (14) (15) where ε multi is the acceptance threshold for detecting multiplicatively separable variables. Based on the above, we can propose the DDG method to obtain better decomposition for both additively and multiplicatively separable problems, which is described in the following contents.

B. DDG
As mentioned before, the idea of DDG is to detect whether variables are additively or multiplicatively separable and then select the best way to partition them into different groups accordingly. The pseudocode of DDG is presented as Algorithm 1. Note that the only difference between DDG and DG lies in lines 20-27, which aim to detect not only additively but also multiplicatively separable variables. That is, DG only uses addi > addi while DDG uses both addi > addi and multi > multi as conditions in the If-statement in line 25 of Algorithm 1. Therefore, the DDG is also as easy to use as DG because it does not consume more FEs than DG and only adds a slight computational burden as line 21 of Algorithm 1.
In general, Algorithm 1 adopts a sequential fashion to detect the possible separable structure between each variable and the other variables. To be specific, Algorithm 1 will use a nested loop, that is, lines 7-35, to check whether each dimension variable is separable from other variables and then group the nonseparable variables together. The novel and key operations of Algorithm 1 lie in lines 20-27, which compute the dual differences for detecting additively or multiplicatively separable structures of variables.
For the additively separable structure, we can calculate the additive difference, that is, addi , to detect interactions as where a, b, c, and d are real numbers within the search domain, and  c, and d are set as the lower bound of x i , the lower bound of x j , the upper bound of x i , and the center of the search domain of x j , respectively. It should be noted that these four variables can also be set with other values, and the settings adopted in this article are just conventional choices in the literature. For simplicity, x with x i = a and x j = b, x with x i = c and x j = b, x with x i = a and x j = d, and x with x i = c and x j = d are denoted as x 1 , x 2 , x 3 , and x 4 respectively, in Algorithm 1.
For the multiplicatively separable problem, as seen in line 21, the algorithm first uses the logarithmic function ln(x) to transform the original function into an additively separable function (similar to how (11) works) and then computes the multiplicative difference multi for interaction detections as It should be noted that if ln(f (x)) encounters calculation errors due to the nonpositive value of f (x), the algorithm will directly set multi with a value larger than ε multi , as shown in line 23 of Algorithm 1. This can ensure that the following procedure will not mistake the corresponding variables as multiplicatively separable due to the calculation error. After computing addi and multi , if they are both larger than their corresponding threshold ε addi and ε multi , respectively, then the variables x i and x j are neither additively separable nor multiplicatively separable. In this situation, x i and x j are considered as nonseparable, and x j will be grouped into the same group of x i , as shown in line 26. Otherwise, x i and x j are separable and will not be grouped into the same group.
The detection operations, that is, lines 12-28, will be repeated until the interactions between x i and all the remaining variables are checked. After detecting the interaction between x i and all the remaining variables, DDG checks whether the temporal group of x i (i.e., tempgroup) has other variables. If no, then x i is a fully separable variable, and its index will be stored in the fully separable group seps. Otherwise, the whole group tempgroup is stored as a new index set in allgroups. After this, DDG removes the index of detected variables (i.e., those in tempgroup) from the index set dims, and then selects an index of the rest of the variables in dims as the next variable for checking its interactions with the other rest variables in dims.
The above procedures will repeat until there is no variable index left in dims. Then, the seps, which includes the indices of all detected fully separable variables, will also be stored as an index set in allgroups. Finally, the DDG will output the allgroups, which contain the decomposition results that include both nonseparable and separable groups.

C. Complete DDG-Based CC Algorithm
As the decomposition method aims to divide the LSOP for better optimization, this part describes how the proposed DDG can be used in a CC framework for solving the LSOP. Fig. 2 presents the flowchart of the complete algorithm framework, and Algorithm 2 shows the pseudocode. Note that the main novelty of Algorithm 2 lies in that the decomposition stage uses DDG to decompose the problems, as shown in line 3. Algorithm 2 is developed by adopting the proposed DDG method in the classical CC framework [26], [27]. Although many CC frameworks have been proposed [48]- [50], the focus of this article is on the decomposition method but not on the CC framework. Therefore, without loss of generality, the most classical and widely used CC framework (i.e., the one used in [26], [27]) is adopted in this article as an example to develop the DDG-based CC algorithm. Similar to other existing decomposition-based algorithms [27]- [29], Algorithm 2 mainly has three procedures: the decomposition stage, the optimization stage, and the combination stage. The End for 22 End while 23 End decomposition stage employs the proposed DDG to partition the problem, where the details can be found in Algorithm 1. In the optimization stage, different subpopulation are formed according to the results from the decomposition stage. Then, each subpopulation will be iteratively optimized for the corresponding subproblem in a round-robin fashion by an optimizer, where the optimizer can be any EC algorithm [50]- [53]. In this article, the self-adaptive differential evolution with neighborhood search (SaNSDE) [53], which is also the classical optimizer in decomposition-based studies [28], is adopted to work together with DDG to develop the complete CC algorithm. After the optimization stage, the combination stage forms the complete solutions with updated subsolutions, as shown in line 18 of Algorithm 2. The above optimization and combination stages will repeat until the MaxFEs are totally consumed. Finally, Algorithm 2 outputs the complete solution and terminates.

D. Analysis of Time Complexity
As the FE is the most time-consuming operation in decomposition methods, this part discusses the time complexity of DDG by analyzing the upper bound for the total number of FEs required by DDG. Without loss of generality, we assume that the problem is a D-dimensional problem with k separable subcomponents, where the subcomponents are nonoverlapped and each of them contains m = D/k variables. In addition, without loss of generality, we assume that variables x (i−1)·m+1 to x i·m belong to the ith subcomponent (i.e., group). As seen in Algorithm 1, only four lines will consume the FEs, which are lines 15 and 18 in the innermost For loop, line 11 in the outer For loop, and line 6 outside the loop. Therefore, the total time for executing these lines will be analyzed one by one in the following contents.
First, we analyze the total time for executing lines 15 and 18 in Algorithm 1, where both of them consume one FE. Instead of directly obtaining the total time for executing them, we can count how many times the algorithm will compute the differences (i.e., lines 19-24 in Algorithm 1) because each difference calculation exactly corresponds to one execution of both lines 15 and 18. For the sake of simplicity, we denote the upper bound as S for how many times the algorithm will compute the differences. According to the above assumptions andAlgorithm 1, after each loop for calculating the differences between a variable x i and all the remaining variables in dims, m variables (including x i ) will be removed from dims. For example, for finding the first subcomponents of interacted variables, the algorithm will calculate the difference for D-1 times (between x 1 and the remaining D-1 variables) and find that variables x 2 to x m interact with x 1 . Then, the m variables x 1 , x 2 , . . . , and x m , (i.e., the first one of the k nonoverlapped separable subcomponents), will be removed from dims. Similarly, for finding the second subcomponents, the algorithm will only calculate the difference for D-m-1 times (between x m+1 and the remaining D-m-1 variables); and for the ith subcomponent, only calculate the difference for D − (i − 1) · m − 1 times (between x (i−1)·m+1 and the remaining D − (i − 1) · m − 1 variables). As there are k subcomponents in total, the upper bound S can be calculated as where D will not be less than m. Then, the total number for executing lines 15 and 18 is 2S. Second, we analyze the total time for executing line 11 in Algorithm 1, which also requires one corresponding FE. Every time line 11 is executed, Algorithm 1 will determine a subcomponent (refer to lines 12-34 in Algorithm 1). As there are k different subcomponents to be identified, the total time for executing line 11 is k, and the total needed number of FEs is also k.
Third, as line 6 of Algorithm 1 costs one FE and is outside the loop, the total time for executing line 6 is 1, which corresponds to one FE. Therefore, based on the above, the total number of FEs required by DDG is 2S + k + 1. That is, the time complexity of DDG with respect to the needed number of FEs is where m and k are not larger than D. Note that the time complexity of DDG, that is, (19), is the same as that of DG [27]. Therefore, the DDG is not time expensive when compared to other DG-based methods.

A. Benchmark Functions and Comparison Methods
To evaluate the proposed DDG, 30 LSOPs are adopted as the test suite in this article, which are all minimization problems. In these LSOPs, 15 of them are the original test functions from the widely used IEEE CEC 2013 benchmark set for the LSGO competition [42], including additively separable functions and fully nonseparable functions. In addition, since there are no multiplicatively separable functions in the original IEEE CEC 2013 benchmark set, 15 multiplicatively separable functions are construed herein based on the functions in this benchmark set. The adopted 30 test functions are shown in Table I. In Table I (20) can obtain the optimum value (i.e., 0) as long as one of the f a and f b (or both) have been optimized to their optimal value 0. Therefore, if the test functions generated by (20) can be decomposed correctly, the optimization difficulties will greatly decrease. This is very suitable for evaluating the effectiveness of different decomposition methods on multiplicatively separable functions.
To further clarify the function characteristics in Table I, the symbols "A," "O," "N," and "M" are used to represent the function type as "additively separable," "overlap," "nonseparable," and "multiplicatively separable," respectively. As shown in Table I, the 30 test functions have various characteristics and, therefore, they can provide in-depth observations about how the proposed DDG may behave on different kinds of problems.
To compare the decomposition ability of DDG, some popular and state-of-the-art decomposition methods are adopted for comparisons of decomposition accuracies. These methods are DG [27], DG2 [28], XDG [29], and GDG [30], as briefly described in the related work in Section II-B. In addition, these decomposition methods are implemented based on their open available source code and are adopted in Algorithm 2 to replace the DDG to develop their corresponding versions of the CC algorithm. That is, all different decomposition methods work with the same optimizer, so as to achieve a fair comparison.

B. Experimental Settings and Evaluation Metrics
In the experiment, the parameters of all decomposition methods are configured according to their original papers. In DDG, the value of addi for detecting additively separable variables is configured as 10 −3 , which is recommended in the literature for detecting additively separable problems [28]. For the multi , the value is set as multi = 10 −8 , where the corresponding parameter study will be given later in Section IV-H. Moreover, to obtain a fair optimization comparison, the CC algorithms using different decomposition methods will adopt the same optimizer SaNSDE [28] as described in Section III-C. The population size of SaNSDE is set as 50, as suggested in [27], [28]. Note that the population size in the original SaNSDE [53] is 100 and the population size may influence the algorithm performance. However, the influence of population size is not the focus of this article. Therefore, the population size is set as the frequently used value (i.e., 50) in the LSOP literature [27], [28]. In addition, the maximum number of available FEs, that is, MaxFEs, is 3 × 10 6 for all algorithms on each problem of T01-T15 according to the literature [42], and is 6 × 10 6 for all algorithms on each problem of T16-T30 because each of them is construed by two problems in T01-T15.
To compare the decomposition ability of different decomposition methods, three evaluation metrics proposed in [30] are adopted in this article. The definitions of these metrics are as follows: where D is the problem dimension, is the interaction matrix obtained by the decomposition method, ( ) i,j equals 1 if variable i and variable j have interaction and 0 otherwise, ideal is the ideal interaction matrix for a problem, and operator "•" is the entrywise product of two matrices. Based on their definitions, ρ overall measures the overall accuracy of the decomposition method, ρ sep only measures the accuracy of separable variable detection, and ρ inter only measures the accuracy of interaction detection [30]. Note that T01-T15 are from existing benchmark problems and have their corresponding ideal interaction matrix [28], denoted as ideal1 , . . . , and ideal15 , respectively. Based on this, the interaction matrix of T16-T30 can be constituted by ideal1 , . . . , and ideal15 . For example, T16 is the multiplication of T01 and T02 and, therefore, its interaction matrix ideal16 is To reduce the statistical error, 25 independent runs of each CC algorithm are carried out on each problem, and the average results are used for comparison. In addition, the Wilcoxon rank-sum test with a significance level α = 0.05 is adopted to statistically compare the optimization results, where the symbols "+," "≈"and "−" are used to show that the proposed algorithm is significantly better than, similar to, or significantly worse than the compared algorithm, respectively.

C. Comparisons on Decomposition Accuracy
The decomposition accuracy of DDG and other decomposition methods are compared in terms of the three evaluation metrics. The results are provided in Table II for the first metric and Table S.I in the supplementary material for the second and third metrics. The best results are marked in boldface. Moreover, to give a clearer understanding of DDG, the detailed grouping results of DDG on T04, T13, T19, and T29 are shown in Tables S.II-S.V in the supplementary material, which are representative additively separable function, overlapping function, multiplicatively separable function generated by fully separable function and overlapping function, and multiplicatively separable function generated by overlapping function and nonseparable function, respectively. First, the decomposition results show that the DDG is competitive with other decomposition methods on the additively separable problem. As shown in Table II, the DDG, in term of the overall decomposition accuracy, can generate results that are competitive with DG on additively separable functions and correctly decompose the 3 fully additively separable functions (i.e., T01-T03). This verifies the ability of DDG to decompose the additively separable problem.
Second, the results in Table II show that the DDG can perform significantly better than other decomposition methods on the multiplicatively separable problem. It can be seen that on multiplicatively separable test functions (e.g., T16-T27), the decomposition accuracy of DDG is more promising than those obtained by other decomposition methods. Moreover, DDG can achieve significantly better decomposition accuracy than both DG2 and XDG on all the 15 generated multiplicatively separable functions, that is, T16-T30. In addition, on T16-T27, the DDG can perform significantly better than DG and GDG on 11 and 11 problems, respectively. For T28-T30, the decomposition accuracy of DDG is not as good as that of GDG. This may be due to that T28-T30 are all constructed by two nonseparable problems or overlapping problems. In such situations, although the problem can be decomposed correctly into two nonseparable or overlapping problems by checking the multiplicative difference, the additive difference cannot work well to decompose the nonseparable or overlapping problems correctly, for example, the original DG also works poorly on T28-T30. However, when compared to DG, the DDG actually works better on T28-T30, which suggests the effectiveness of multiplicative difference in DDG.
Third, Table S.I in the supplementary material shows that the main advantage of DDG is the detection of multiplicatively separable variables. As shown on the left side of Table S.I in the supplementary material, the DDG obtains 100% accuracy on detecting the separable variables in T16-T27, while DG, DG2, and XDG only have much lower accuracy. For the GDG, although it can also obtain high accuracy on detecting the separable variables of T16-T30, similar to DDG, it can only have low decomposition accuracy for the interacted variables (as shown on the right side of Table S.I in the supplementary material), and its overall detection accuracy therefore decreases. In addition, Table S.I in the supplementary material shows that on the multiplicatively separable problems T16-T30, the XDG has high accuracy on interacting variables but very poor accuracy on separable variables. This may be because the XDG fails to detect the separable structures in T16-T30 and mistakes many variables as interacting, resulting in high accuracy on the interacting variables but nearly zero accuracy on the separable variables. However, the DDG proposed in this article does not have this problem because it can detect the separable structure in not only additively separable problems but also multiplicatively separable problems.
In conclusion, the comparisons on decomposition results have shown the great effectiveness of DDG on both the additively and multiplicatively separable problem.

D. Comparisons on Optimization Results
To investigate the advantage of the DDG-based CC algorithm in optimizing LSOPs, this section compares the optimization results. The comparisons are made among the CC algorithms that use different decomposition methods, including DG, DG2, XDG, and GDG. Also, the random decomposition (RD) method is adopted to evaluate the effectiveness of DDG, where the RD groups variables randomly into five groups at the beginning of each run. For simplicity, the CC algorithm with the decomposition method X is denoted as "CC-X" in the following contents.
The comparison results are provided in Table III and the detailed results are given in Table S.VI of the supplementary material. According to the Wilcoxon rank-sum test, CC-DDG significantly outperforms CC-DG, CC-DG2, CC-XDG, CC-GDG, and CC-RD on 15,16,16,16, and 15 problems, respectively. Furthermore, CC-DDG can obtain the best results on most multiplicatively separable problems (i.e., T21-T26 and T28-T30), showing its strong effectiveness in solving large-scale multiplicatively separable problems.  In addition, on additively separable functions (T01-T11) and nonseparable functions (T12-T15), CC-DDG can also perform similarly to CC-DG and obtain the best results on T03, T06, and T10 among the six CC algorithms with different decomposition methods, which suggests that CC-DDG can also have competitive performance on additively separable LSOPs. In conclusion, CC-DDG is effective for optimizing both additively and multiplicatively separable LSOPs.

E. Comparisons With the Champion Algorithm
This part compares the DDG-based algorithm with the champion algorithm on IEEE CEC 2019 LSGO competitions, that is, the RDG3-based algorithm [32]. As the champion algorithm adopts the contribution-based CC framework (CBCC) as its optimization framework [32] and the covariance matrix adaptation with evolutionary strategy (CMA-ES) [54] as its optimizer, the DDG-based algorithm is also integrated with CCBC and CMA-ES for a fair comparison. Moreover, as the RDG3 is an enhanced extension of DG (i.e., the 3 rd version of recursive DG) for solving overlapping problems, we should also extend the DDG with the 3 rd version of recursive strategy to obtain the RDDG3 for a fair comparison. The extension is very easy by replacing DG with DDG, where the modification of RDDG3 over RDG3 is given in Algorithm S.1 (lines 9-15) of the supplementary material. This way, the RDDG3 is expected to be suitable for additively separable, multiplicatively separable, and overlapping problems. The three algorithms are called CBCC-RDDG3, CCBC-DDG, and CCBC-RDG3, where the settings of CBCC and CMA-ES in the three algorithms are configured the same according to the original paper of CCBC-RDG3 [32].
The comparison results of grouping accuracy among RDDG3, DDG, and RDG3 are provided on the left side of Table IV, where the corresponding detailed results can be seen in Table S.VII of the supplementary material. As shown in Table IV, the RDDG3 and DDG can obtain better grouping accuracy than the RDG3, especially on the 15 multiplicatively separable problems (i.e., the T16 to T30). To be specific, on the problems T01-T15, the RDDG3, DDG, and RDG3 obtain the best grouping results on 8, 11, and 6 problems, respectively. On the 15 multiplicatively separable problems (i.e., the T16 to T30), the RDDG3 and DDG can obtain the best grouping results on nine and ten problems, respectively, while the RDG3 has the best results on none of these problems. Therefore, the grouping results have shown the effectiveness of the RDDG3 and DDG on problem decomposition.
As for the optimization results, the comparison results of CBCC-RDDG3, CBCC-DDG, and CBCC-RDG3 are given on the right side of Table IV, where the corresponding detailed results can be seen in Table S.VII of the supplementary material. The results show that, when combining the advantages of RDG3 and DDG together, the CBCC-RDDG3 can outperform the CBCC-RDG3 not only on the 15 test functions T01-T15 in the original IEEE CEC 2013 LSGO benchmark, but also on the 15 generated multiplicatively separable problems T16-T30. Specifically, according to the Wilcoxon rank-sum test, CBCC-RDDG3 performs significantly better than CBCC-RDG3 on 6 and 8 problems of T01-T15 and T16-T30, respectively. This means that the CBCC-RDDG3 has better overall performance than CBCC-RDG3 on both the additively and multiplicatively separable problems. Although CBCC-DDG may be not superior to CBCC-RDG3 on some overlapping problems, it performs better than CBCC-RDG3 on 8 out of the 15 multiplicatively separable problems (i.e., T16-T21, T24, and T29, more than a half). Moreover, on the 12 overlapping problems as indicated in Table I, CBCC-RDDG3 significantly outperforms CBCC-RDG3 on seven of them (i.e., T12-T14, T20, T22, T25, and T28, more than a half). These suggest that the DDG variants (especially the resulted RDDG3) can be potential for more kinds of problems including the additively separable, multiplicatively separable, and overlapping problem. Moreover, as the RDDG3 is the recursive version of the DDG method and the RDG3 is the recursive version of the DG method, the superiority of RDDG3 over RDG3 further indicate that the DDG (i.e., considering not only additively separable problems but also multiplicatively separable problems) is more promising than the DG (i.e., considering only additively separable problems), which is a major motivation and contribution of the proposed DDG.

F. Comparisons With the State-of-the-Art Nondecomposition Algorithms
Besides the above comparisons with the decompositionbased algorithms, this part further compares the DDG-based CC algorithm with some state-of-the-art nondecomposition algorithms. In particular, the three well-known algorithms in the literature, that is: 1) SHADE-ILS [55]; 2) MLSHADE-SPA [56]; and 3) MOS [57], are adopted in the comparison.  In the experiment, the SHADE-ILS is also adopted as the optimizer in CC-DDG, so that we can see the advantage of DDG more clearly from the comparisons. Moreover, other decomposition-based CC algorithms with SHADE-ILS as the optimizer are also adopted in the comparison to investigate the effectiveness of CC-DDG.
The comparisons of statistical results are given in Table V, where the detailed results are provided in Table S.VIII of the supplementary material. As shown in Table V, the CC-DDG can have a better overall performance than the compared algorithms including decomposition-based and nondecomposition algorithms. Specifically, when compared to the state-of-theart nondecomposition algorithms, CC-DDG performs significantly better than SHADE-ILS, MLSHADE-SPA, and MOS on 15, 19, 20 problems, similar on 7, 4, and 3 problems, while significantly worse only on 8, 7, and 7 problems, respectively. That is, the comparisons have shown the effectiveness of CC-DDG and its potential to integrate with state-of-the-art nondecomposition algorithms.

G. Component Analysis of DDG
To investigate the component contribution, the DDG is compared with its variants that do not use the additive difference addi or the multiplicative difference multi for detecting separable structures. For simplicity, these two variants are simply denoted as DDG-w/o-addi and DDG-w/o-multi, respectively. Note that the DDG-w/o-multi is the same as DG as it does not detect the multiplicatively separable structure and only detects the additively separable structure. The comparison results of the variants are provided in Table S.IX of the supplementary material.
First, in term of the overall accuracy, as shown in the left three columns of Table S.IX in the supplementary material, the DDG obtains the best results on more problems than both the DDG-w/o-addi and DDG-w/o-multi. Specifically, among the 30 tested problems, DDG produces the best results on 22 problems, while DDG-w/o-addi and DDG-w/omulti only obtain the best results on 16 and 9 problems, respectively. More importantly, among the three methods, the DDG-w/o-addi performs worst on additively separable problems (e.g., T01-T11), while DDG-w/o-multi performs worst on multiplicatively separable problems (e.g., T16-T30), which suggests the significance of both the additive difference and multiplicative difference for detecting separable structures.
Second, considering the accuracy of separable variable detection, as shown in the middle three columns of Table S.IX in the supplementary material, the contributions of additive difference and multiplicative difference are more obvious. As seen, the DDG-w/o-addi method has very poor accuracy in finding the separable variables in the additively separable problem, such as the fully separable problems T02 and T03. In addition, the DDG-w/o-multi method works very badly on detecting the separable variables in the multiplicatively separable problem, such as T25 and T27. Therefore, the decomposition ability of DDG will deteriorate if the additive difference or multiplicative difference is not used.
Third, the accuracy of interaction detection, together with the overall accuracy, further reveals the poor decomposition ability of DDG-w/o-multi, showing the effectiveness of multi . For example, although DDG-w/o-multi can detect the interaction structure in T27, it fails to detect the separable variables in T27, especially multiplicatively separable variables, and mistakenly characterizes them as interacting. As a result, most elements in the obtained by DDG-w/o-multi for T27 are 1. This leads to the high value of ρ inter (i.e., only on interacted variables) but a very small value on ρ sep (i.e., on separable variables) and ρ overall (i.e., on overall performance). In other words, in terms of the detection accuracy on separable variables and on all variables, DDG-w/o-multi performs very poorly because it cannot detect the multiplicatively separable variables correctly. Similar results can be also seen on other multiplicatively separable problems, e.g., T20-T26, T28, and T29. These results further verify the great effectiveness and contribution of multi for decomposing multiplicatively separable problems.
From the above, it can be concluded that both the additive and multiplicative differences have their own contributions to the effectiveness of DDG, and removing any of them will decrease the decomposition performance of DDG.

H. Influences of the Threshold for Multiplicatively Separable Detection
This part studies the influence of the threshold value for detecting multiplicatively separable variables. For this aim, we compared the DDG that uses the original setting (i.e., multi = 10 −8 ) with its variants using multi = 10 −4 , multi = 10 −12 , and multi = 10 −16 , which are denoted as DDG-8 (the original DDG), DDG-4, DDG-12, and DDG-16, respectively. The decomposition results are provided in Table S.X of the supplementary material.
First, Table S.X in the supplementary material shows that the threshold setting multi for detecting multiplicatively separable variables does not have a significant influence on the decomposition accuracy for additively separable problems. It can be seen that on additively separable functions, for example, T01-T11, the DDG with different threshold settings obtains similar decomposition accuracy. In other words, the settings of multi will not affect the decomposition ability of DDG on additively separable problems and, therefore, the DDG with a proper multi can be suitable for both additively separable problems and multiplicatively separable problems.
Second, Table S.X in the supplementary material shows that different multiplicatively separable problems favor different multi . As shown in Table S.X of the supplementary material, multi = 10 −8 achieves higher accuracy on T21 and T24 than other settings, multi = 10 −12 obtains the best results on T25-T27, multi = 10 −16 outperforms other settings on T19, T20, T22, and T23, and the best results in T28-T30 are produced by multi = 10 −4 . Generally, when the test functions are construed by functions f 1 − f 3 , smaller threshold settings (i.e., multi = 10 −8 , multi = 10 −12 , and multi = 10 −16 ) can have higher decomposition accuracy than the larger setting multi = 10 −4 , while for problems constructed by T13-T15, larger threshold settings (e.g., multi = 10 −4 ) can produce better results than other settings. This may be because that different problems require different threshold values to determine whether the variables are multiplicatively separable. In addition, although multi = 10 −4 obtains most of the best results among the four settings, it has a much worse result on T15 than other settings, which means that its performance is sensitive to the problem characteristics and may have a poor generalization ability. Therefore, it should not be considered a good setting in this article. Instead, multi = 10 −8 balances the decomposition accuracy on different problems and is recommended in this article.
In conclusion, the threshold value multi does not have a significant influence on the decomposition ability of DDG on the additively separable problem, but a good setting of multi can further improve the ability of DDG to decompose multiplicatively separable problems.

I. Case Study on the Parameter Optimization for Neural Network-Based Application
To further study the effectiveness of the proposed DDG, this part conducts a case study on the NN parameter optimization application. NN is efficient for classification problems and has been widely used in many real-world applications [40], [41]. However, the parameters (e.g., weights and bias) of the NN are essential to the performance, which requires efficient optimization. Moreover, the NN often contains a great number of parameters (always more than 1000) to be optimized. Therefore, the parameters optimization of NN is a typical LSOP, which is suitable to evaluate the application ability of DDG.
In this article, we consider an NN-based classification system for the wine classification task, where the public wine dataset is collected from the UCI Machiness Learning Repository (http://archive.ics.uci.edu/ml/index.php). The dataset contains 178 samples, and each sample has 13 features and belongs to one of the three categories. For such a classification problem, a widely used three-layer NN model is adopted in this article, which includes one input, one hidden, and one output layer, as shown in Fig. 3. The input layer has 13 neurons to receive the 13 features. The hidden layer has 60 neurons, with the activation function of each hidden neuron being the widely used sigmoid function. Moreover, three-dimension one-hot encoding is used to encode the three categories and, therefore, there are three neurons in the output layer. For each input sample X i , the target output Y i is a 3-D vector with Y i,j == 1 if X i belongs to the jth category and Y i,j == 0 otherwise (j = 1, 2, 3). Therefore, given NS samples and their corresponding targeted outputs, the NN parameter optimization problem can be defined as a minimization optimization problem with the objective function F(θ ) as where θ represents the parameters of the NN, that is, the variables to be optimized, NN(θ, X i ) is the output of the NN, ||NN(θ, X i ) − −Y i || 2 computes the square sum of the 3-D differential vector between NN(θ, X i ) and Y i , and smaller F(θ ) is better. Based on the above, the number of parameters between the input and hidden layers is 60 × (13 + 1) = 840 (each of the 60 hidden neurons requires 13 weights for the 13 features inputs and one bias), and the number of parameters between the hidden and output layers is 3×(60+1) = 183 (each of the 3 output neurons requires 60 weights for the 60 hidden neurons and one bias). Therefore, the total number of parameters for optimization is 840 + 183 = 1023, that is, the parameter optimization problem is with 1023 variables. In addition, for a fair comparison, all decompositions methods are integrated with the CC framework and have 3×10 6 FEs in total for every independent run (including the FEs consumed for decompositions). Besides, the first 80% of samples data are used as training data while the last 20% of samples data are treated as test data, and the search range of each parameter is [−1, 1]. Table VI gives the comparison result of different decomposition-based CC algorithms over 25 runs. As can be seen, the CC-DDG obtains the best training loss and accuracy among the seven algorithms. Moreover, it is interesting that only DDG can decompose the problem into different groups, while all other methods (except the RD) fail to decompose the problem but still consume considerable unnecessary FEs, which suggest that this problem is not additively separable but multiplicatively separable. Therefore, our proposed DDG method is promising. For better visualization, Fig. 4 provides the grouping map of the 1023 variables based on the DDG results, where the color of each pixel (i, j) means the ith and jth variables are in the same group of the corresponding color except that white color means the ith and jth variables are not in the same group. As shown in Fig. 4, the groupings have a regular pattern and similar size after the decomposition by DDG. Therefore, the superiority of CC-DDG may be due to that the DDG can decompose the problem appropriately with a small amount of FEs and then the problem can be optimized more efficiently and with more remainder FEs. Based on the above, the case study has verified the effectiveness of the proposed DDG.

V. CONCLUSION
In this article, we attempted to obtain a more general decomposition method to decompose and optimize more kinds of LSOPs correctly and efficiently. First, we mathematically defined the multiplicatively separable function and showed its relationship with the additively separable function. Then, we proposed and proved two related theorems that can guide the detection of the separable structure in multiplicatively separable problems. Based on the above, the novel DDG method was proposed to achieve a more general decomposition ability for LSOP. The DDG can utilize two kinds of differences to detect the separable structure of the additively and multiplicatively separable problem. Moreover, the DDG-based CC algorithm framework is developed for solving LSOP by combining the DDG with a classical and widely used CC framework. In addition, the time complexity of DDG is also analyzed with respect to the number of FEs. Extensive experiments were conducted on 30 LSOPs and a case study on parameter optimization for the NN-based application, where state-of-the-art methods are adopted for comparisons. The experimental results have verified the effectiveness and efficiency of the DDG and the DDG-based algorithm.
For future work, the idea of DDG will be further studied to reduce the FEs cost, improve the detection ability, and address the disadvantage in handling negative objective functions. Furthermore, the DDG will be further studied and extended to solving more kinds of separable problems, such as multiplicatively separable functions generated by different types of functions (e.g., T04-T11). Also, the combination of different decomposition methods and optimizers is worthy of research to develop more powerful algorithms. Besides, the DDG-based algorithms with the distributed computation method [58]- [60] are worthy to be studied and applied to challenging real-world LSOPs and big data applications.