Explicitly Constrained Black-Box Optimization With Disconnected Feasible Domains Using Deep Generative Models

We tackle explicitly constrained black-box continuous optimization problems in which the feasible domain forms a union of disconnected feasible subdomains. The decoder-based constraint-handling technique is a promising approach when the feasible domain is disconnected. However, the design of a reasonable decoder requires deep prior knowledge of the optimization problem to be solved and, hence, human effort. In this study, we investigated the usefulness of a deep neural network as a decoder and developed a training scheme for a deep neural network without prior information, such as a training dataset consisting of feasible and infeasible solutions required by existing decoder approaches. To stabilize the training of the deep generative model as the decoder, we propose decomposing the decoder into sub-models, introducing skip connections to each sub-model, and training the sub-models sequentially with separate loss functions. Numerical experiments using a test problem and a topology optimization problem show that the proposed method can find feasible domains with better objective function values and higher probability than both conventional decoder-based constraint-handling methods and non-decoder-based constraint-handling methods.


I. INTRODUCTION
Black-Box Optimization (BBO) is a class of mathematical optimization problems in which the objective function is a black-box function. When a simulation is required to evaluate the objective function value of a solution, such an optimization problem is often treated as BBO because the relation between a solution and its objective function value is nearly a black-box. BBO appears extensively in engineering fields such as material engineering [51], aeronautical engineering [5], [42], ocean engineering [41], [46], civil engineering [11], [45], mechanical engineering [3], [17], [18], [19], [20], and artificial intelligence [13], [23].
The associate editor coordinating the review of this manuscript and approving it for publication was Huiyan Zhang .
In this study, we consider the following black-box continuous optimization with explicit constraints: where f , g 1 , . . . , g m : S → R are the objective function and m constraint functions, respectively, and S ⊆ R n is the search domain on which f is well defined. We assume that S is a rectangular region (e.g., S = [−1, 1] n ). In BBO, the objective function usually requires computationally expensive simulations, and the gradients of the objective function, ∇f , are unavailable or unreliable. Constraint functions can be categorized as either explicit or non-explicit constraints.
In this study, we focus on explicit constraints, where the constraint functions are computationally cheap to evaluate, and their gradients, ∇g j , are often available.
This study focuses on explicitly constrained BBO problems where the feasible domain X := {x ∈ S | g j (x) 0, ∀j} is the union of non-connected feasible subdomains X l (l=1,...,M ) , X i ∩X j = ∅ (i = j), that is, X := L l=1 X l . Such a problem is difficult to address because it is challenging to search across different disconnected domains.
This work originated in the field of topology optimization (TO) [8], [9], [54], which is an optimization that determines the optimal distribution of materials in a given design space. Several methods have been proposed, such as levelset methods [54], density-based methods [8], and NGnet methods [48]. In TO, a material configuration is parameterized using a real vector x, and an objective function encompasses a boundary value problem of partial differential equations, which are computed using numerical solution methods such as the finite element method. Therefore, the evaluation of the objective function of TO is relatively computationally expensive, and the relationship between the solutions and objective function values is almost a black-box. Restrictions, such as volume and perimeter constraints [19], are often imposed to prevent undesirable material arrangements, such as checkerboard arrays [14]. The computational cost of restrictions is significantly lower than that of the objective function, and because the relationship between the constraint function values and solution is explicitly available, it is often possible to compute their gradients. However, the resulting feasible domain can be separated into several disconnected subdomains. Therefore, this problem can be considered an example of the aforementioned BBO problems.
Evolutionary computation has been widely used as a promising optimization method in BBO. This is because evolutionary computation does not require gradients or a priori knowledge of the characteristics of the objective function, such as smoothness, and has been empirically shown to have a low initial value dependence, which are properties required for BBO. In particular, the covariance matrix adaptation evolution strategy (CMA-ES) [2], [25], [29], [30], [32], a type of evolutionary computation, is a quasiparameter-free method, and has been successfully applied to several real-world BBOs [10], [16], [36], [40], [41], [52], [53], including various TO applications [17], [18], [19], [20]. In this study, we also investigate optimization based on CMA-ES.
Decoder-based techniques [12], [35] are considered promising approaches for BBO problems with disconnected feasible subdomains [43]. The idea underlying decoder-based approaches involves designing a map G : Z → S, called the decoder, which is a mapping from a relatively simple and possibly convex region Z, such as a hypercube, to the search domain S. In this work, we consider setting Z = S ⊆ R n . Ideally, the image of G should be in the feasible domain, that is, G(Z) = X. Then, the original problem (Equation (1)) is translated into the following problem: argmin z∈Z f (G(z)) s.t. g j (G(z)) 0, ∀j = 1, . . . , m. (2) Once a reasonable decoder is obtained, the solver only needs to deal with a relatively simple boundary of Z. However, existing works in decoder-based approaches [12], [35] require prior knowledge about the feasible domain or a dataset of feasible points to design or train a decoder.
The main contribution of this study is the automation of the decoder design while inheriting the decoder feature that transforms the original optimization problem with a disconnected feasible domain X (Equation (1)) into an optimization problem with a convex domain Z (Equation (2)), whose boundary is easy to handle. In practice, owing to the continuity and incomplete training of neural networks, G(Z) ⊆ X cannot be guaranteed, and the constraints of the transformed optimization problem (Equation (2)) are violated in some regions of Z. However, the fraction of the Lebesgue measure of the feasible domain on Z (regions of z ∈ Z, such that G(z) ∈ X) is much larger than the fraction of the Lebesgue measure of the feasible domain X on the search domain S, as shown in the numerical experiments in Section IV. When the proportion of infeasible domains is small, a stochastic multipoint search method, such as CMA-ES, can ignore the infeasible domains because the probability of generating infeasible solutions is sufficiently low. This is expected to make it easier to identify feasible domains with globally superior objective function values.
The remainder of this paper is organized as follows: Section II reviews decoder-based constraint-handling techniques; Section III introduces the proposed method; Section IV quantitatively analyzes the proposed decoder G using a test problem in which the number of disconnected regions in the feasible domain can be adjusted; Section V verifies the effectiveness of the proposed method using a topology optimization problem by comparing it with existing decoder methods and typical constraint-handling methods; and finally, Section VI concludes the paper.

II. REVIEW OF DECODER-BASED CONSTRAINT-HANDLING
Many evolutionary computation methods, including CMA-ES, were designed by assuming unconstrained optimization or optimization with only rectangular constraints. Therefore, when applying them to constrained optimization, it is necessary to use them in conjunction with constrainthandling methods. The penalty function method is widely used in engineering as a constraint-handling method owing to its versatility and ease of implementation. However, the optimization becomes inefficient if the penalty coefficients are not well adjusted; if the penalty is too small, the constraints will be violated, and if the penalty is too large, the landscape of the penalized objective function will have a deceptive structure (i.e., a globally weak structure [27]), as shown in Figure 1. It is empirically known that CMA-ES can appropriately search on globally well-structured functions (i.e., functions with a big valley structure); however, it often fails to locate good local optimal solutions on globally weakly structured functions (i.e., functions with a deceptive structure) [28], [39]. Therefore, such a constraint-handling method is not suitable for BBO problems with disconnected feasible subdomains.
The advantages of using a decoder include the following factors. First, we believe that optimizing the transformed problem in Equation (2) is expected to be much easier than optimizing the original problem Equation (1); if we have a decoder satisfying G(Z) ≈ X, then the optimization method only needs to consider the simple boundary of the convex domain Z, and effective constraint-handling methods have been proposed for such constraints (e.g., [47]). Second, we can utilize the fact that the computational cost of the constraint function g j is significantly lower than that of the objective function f . In the case of the penalty function method, it is necessary to evaluate f , which is expensive even when the constraint violations are large and the objective function values are not necessary to guide the search method. By contrast, learning a decoder requires considerable access to constraint functions, but the cost is relatively small. Searching for the function defined in Equation (2) using the learned decoder on Z corresponds to searching only in the feasible domain X; thus, the number of f -calls can be reduced. Therefore, when the measure of the feasible domain X is very small compared with the measure of the search domain S, it is expected that optimization can be achieved in less time.
In [35], a method was proposed to create a quasi-isomorphic map between X and an n-dimensional hypercube Z as the decoder. This mapping was defined non-explicitly by a line search between the generated candidate solution z and a predefined reference point (a feasible solution). The disadvantage of this method was that the decoder tended to project an infeasible solution to the feasible domain in which the reference point was located. Therefore, domain knowledge of the problem and trial-and-error are required to determine a reasonable reference point. In [12], a method using a support vector data description (SVDD) was proposed to model and train a decoder using a dataset consisting of feasible and infeasible solutions. This method alleviated the tendency to select a single feasible domain, but it required the preparation of the training dataset in advance. The SVDD-based decoder method required expert knowledge of the problem and human effort to design and train the decoder. If we can obtain a decoder without domain knowledge or human effort, we can automatically transform a difficult problem (Equation (1)) into a relatively simple problem (Equation (2)); however, to the best of our knowledge, such a method has not yet been explored.
The purpose of this work is to automate the design of the decoder G : Z → X and facilitate the handling of explicit constraints in BBO, where the feasible domain is the union of several disconnected domains. As the feasible domain X is disconnected, the decoder must be able to represent highly nonlinear maps. Recent developments in deep generative models, such as generative adversarial networks (GANs) [21], variable auto-encoders (VAEs) [34], and flow-based deep generative models (flows) [15], have shown promising performance in terms of representing highly nonlinear mappings. However, unlike machine learning tasks, in which deep generative models are usually applied, the difficulty in using deep generative models in this research is that we do not have prior training data at hand. To address this difficulty, we propose a new training method that does not require the prior collection of training data.
In this study, we propose a method for training a decoder using a deep neural network. To represent a highly nonlinear mapping, it is necessary to use a deep network; however, the deeper the network, the more difficult it becomes to stably learn a disconnected feasible domain, and it is easy to obtain a model that can only represent a few disconnected domains. This phenomenon is similar to the well-known mode collapse observed in deep generative models. To solve this problem, we propose to divide the decoder network into sub-models, define a loss function for each sub-model, and train each sub-model sequentially. Numerical experiments show that the proposed method can stably learn many disconnected feasible domains without using training data.

III. PROPOSED METHOD
The primary research question in this study is that of how to train a decoder G without preparing the training data in advance, that is, without human effort. In GANs, we train a generator such that its output is indistinguishable from the training data by the discriminator. In the current context, the training data correspond to a feasible solution set in the feasible domain X; however, we do not have such training data. Instead, we use the information that the constraint functions g j that define X are explicit.

A. DESIGN PRINCIPLE
We consider that G should have the following properties: (a) G is a surjection from latent space Z ⊆ R n to feasible domain X ⊆ R n ; (b) in each feasible subdomain, X , VOLUME 10, 2022 G −1 : X → Z is continuous; and (c) in the latent space, two neighboring points z 1 , z 2 ∈ Z are projected onto the same feasible subdomain X or onto neighboring feasible subdomains X and X . These properties are desirable for successful optimization by transforming the original optimization problem defined in Equation (1) into the problem in Equation (2). Requirement (a) is necessary because if x * = argmin x∈S f (x) / ∈ G(Z), then solving the transformed problem will not yield the optimal solution x * . The requirements (b) and (c) represent a type of continuity; if a small change in the latent space leads to a large change in its projection, a sudden change in the objective function value occurs. Consequently, even if the original objective function f is unimodal and has a good scale, the transformed objective function f • G can either be multimodal or have a bad scale. The above requirements are intended to prevent these problems from occurring.
We represent a decoder G that satisfies the above requirements by using neural networks (NN) G θ : R n → R n . In the following, we consider the latent space Z of the decoder to be equal to the search domain S. In this case, we can consider learning G θ by minimizing the loss function, as follows: where P z represents a probability distribution on Z ⊆ R n and φ : R n → R 0 is an aggregation of constraint violations. For example, it can be defined as where Z : R n → Z is the projection onto Z. As we assume that Z = S is a hyper-rectangle and it is easy to force the decoder to output solutions in Z, e.g., by using Tanh output activation, we can often ignore the projection. By training the NN according to Equation (3), we expect z = G θ (z) if z ∈ X, and G θ (z) to be a projection from z to the nearest feasible domain if z / ∈ X. When training the decoder G θ , we generate M samples, z 1 , . . . , z M (called minibatch), from P z independently every iteration and apply minibatch training. In other words, the expected value E z∼P z is minimized by replacing it with the sample mean 1 M M i=1 for the newly generated mini-batches in each iteration. Note that this loss function does not require the preparation of training data, such as feasible solution sets, in advance.
By training G θ using Equation (3) as the loss function, we expect to obtain G θ , such that G θ (z) ∈ X for any z ∈ Z. However, as shown in Section IV, G θ (Z) trained using the loss function above tends to be biased toward some regions of X. In standard generative models, such as GANs, the phenomenon of output bias toward some training data is notorious as a mode collapse. In this study, we did not use training data; therefore, the phenomenon described above is not necessarily the same as general mode decay; however, it is similar. Here, we refer to mode collapse as the state in which the NN model learns only a portion of the disconnected feasible domain. In the next section, we present the proposed structure and loss function of an NN that avoids mode collapse.

B. PROPOSED STRUCTURE AND TRAINING SCHEME FOR DEEP DECODER NETWORK
One reason for mode collapse is that the decoder network has insufficient representational capability. To learn mapping to a disconnected region, such as the one targeted in this study, it is necessary to use a network with sufficient expressive power, that is, a sufficiently deep network. However, the deeper the network, the more unstable its learning is.
The first idea is to introduce a skip connection into the decoder network. When the loss function is defined as in Equation (3), the first term requires that G θ is close to an isomorphic map. However, it is challenging to realize isomorphic mapping in deep neural networks. To reduce the difficulty, we incorporated the skip connection proposed by ResNet [31] into the model structure.
A skip connection is a structure in which an input to the model is added to the output. That is, we have G θ (z) = z + G res θ (z), where the model to be trained is the residual block G res θ : Z → R n . By adopting a skip connection, only the difference between z and the feasible domain X must be represented by G res θ . This is expected to reduce the learning difficulty compared with simply learning the mapping from Z to X.
Consequent upon preliminary experiments, the number of feasible domains that decoder G θ can learn was improved by introducing a skip connection. However, in tasks with numerous disconnected feasible domains, we have not been able to prevent mode collapse, in which G θ (Z) is biased toward some regions. This may be due to the difficulty of training G θ such that all z / ∈ X are projected to the nearest feasible domain X . If both the mini-batch size and the capacity of G θ are large, it would be feasible. However, in reality, the mini-batch size is finite, and when θ is updated to project a finite number of latent vectors to X = X , the latent vectors that are not in the mini-batch are mapped to X . Once such a mapping is learned, it is difficult to correct it because when θ is updated such that an input z is mapped to X , the constraint term is dominant in Equation (3). If z is incorrectly projected to a distant X , and we try to move it to a neighboring X by the effect of homogeneous mapping, a single parameter update will only take z out of the constraint momentarily, and it will be quickly pushed back to the original X when the constraint term is dominant.
The second innovation represents the decoder G θ as a composite G θ = G θ K • · · · • G θ 1 of several sub-models G θ 1 , . . . , G θ K , and trains each sub-model G θ k independently. For input z, we define x 0 = z and x k = G θ k (x k−1 ). In this case, G θ : z → x K and x K are the final output. The idea is to gradually move the input data closer to X each time it passes through each sub-model G θ k ; thus, the difference in the mapping of G θ k is small and mode collapse is prevented to the decoder obtained by SeqTrain (the proposed approach).
(see Figure 2). Each sub-model G θ k is represented using a skip connection, , as shown in the first device, and is trained according to the loss function where γ k = γ 0 α k−1 are the penalty coefficients. The reason for increasing the coefficients γ k exponentially is to align the differences between the mappings of each sub-model G θ k as much as possible. Because the output x k approaches X as k increases, the constraint term decreases as k increases. In other words, to equalize the weight of the constraint term with the first term for each loss function L k , it is appropriate to increase the coefficient γ k as k increases. This idea is expected to alleviate the aforementioned problems.
The third innovation is to train the coupled model sequentially. The simplest way to train the coupled model G θ = G θ K • · · · • G θ 1 described above is to update the training parameters of all the sub-models G θ k simultaneously at every iteration. By contrast, in the proposed method, we train the connected sub-models G θ k starting from k = 1. Initially, only the model G θ 1 is trained using Equation (5). After a certain number of training iterations, the parameter θ 1 is fixed. Next, the model G θ 2 is trained. In this case, the input to the summodel G θ 2 is G θ 1 (z) for each input z. This process is repeated until k = K . In other words, there is always only a single sub-model G θ k being trained, and its input is the output of model G θ k−1 • · · · • G θ 1 with fixed parameters. The advantage of sequential training is that the number of models can be increased indefinitely by repeating the aforementioned operations; therefore, there is no need to determine the number of models K in advance. In addition, by starting with a small value of the coefficient γ k of the constraint term and increasing the number of models while observing the Algorithm 1 Sequential Decoder Training situation, we can reduce the risk of setting γ k too large and causing mode collapse.
The proposed training scheme for the decoder network is summarized in Algorithm 1.

IV. QUANTITATIVE EVALUATION OF TRAINED DECODER ON TEST PROBLEM
This experiment was performed to quantitatively evaluate the effectiveness of the proposed decoder training method in learning disconnected feasible domains. We focused on determining the number of regions that could be learned when the feasible domain was the union of several disconnected regions; that is, X = L =1 X and X ∩ X = ∅ for any = .
First, we designed a test problem to quantitatively evaluate the number of learned feasible domains X in model G θ . Using this test problem, we focused on (i) the coverage of disconnected regions (see below) and (ii) the search performance using a trained decoder. Using these two metrics, we quantitatively measured the quality of the decoder training method.

A. TEST PROBLEM
To evaluate indices (i) and (ii), we define the following test problem on S = [−1, 1] n : where x * ∈ R n denotes a uniform randomly generated vector from [−1 + ε, 1 − ε] n . The feasible domain is the union of L disconnected closed hyperspheres,B(x * , ε). In other words, X = L =1 X , where X =B(x * , ε). The number of disconnected feasible subdomains and their volumes are easily controllable by changing L and . Moreover, because the location of each feasible subdomain is clear, it is suitable for analyzing the quality of trained decoders.
The optimal solution of the objective function x * f ∈ [−1, 1] n is randomly sampled such that x * f / ∈ X, and the optimal solution to the test problem is at the boundary of the nearest feasible domain from x * f , that is, x * = argmin y∈X x * f −y . The number of dimensions was n = 10, the number of feasible domains was L = 100, and the radius was ε = 0.25.

B. EXPERIMENTAL SETTING
The structure of each sub-model G θ k (k = 1, . . . , K ) is described in Table 1. Mish [44] is similar to ReLU, which is widely used as an activation function; however, Mish is non-flat and smooth everywhere, and the loss function is also smooth, making it easier to optimize than ReLU. The number of sub-models in the coupled model was set to K = 10, the initial value of the coefficients of the constraint term was γ 0 = 0.1, and the increasing coefficient was α = 1.3. The ADAM optimizer [33] was used for training, with a mini-batch size of 128, a learning rate of 0.001, and β 1 = 0.3. A uniform distribution U (S) was used as the generating distribution P z of z. The maximum number of training iterations was set to 20 × 10 4 , and the number of training iterations for each sub-model during sequential training was set to 20 × 10 4 /K = 2 × 10 4 .
The data used by the chosen existing decoder methods, a decoder with Homomorphous Mapping (HM) [35] and decoder with SVDD [12], that is, the set of feasible solutions, were collected by minimizing the constraint violations as the objective function by CMA-ES and saving the feasible solutions obtained in the search process. When g was minimized and the search distribution enters the feasible domain, most of the candidate solutions become feasible solutions, their objective function values all become zero, and the search distribution starts a random search within the feasible domain. The search distribution was restarted every time 100 feasible solutions were collected, and feasible solutions spanning as many regions as possible were collected.
Here, feasibility denotes the probability that G(Z) will enter X and coverage denotes the fraction of the disconnected partially executable region X that G has learned. In our experiments, we set P cov = 0.01, and the probability Pr was estimated using the Monte Carlo method separately from the mini-batch, with 2 15 samples z ∼ P z . By examining these two metrics, we can evaluate how well model G is learning X.
Optimization of the test problem using CMA-ES with the trained decoder model was conducted for 20 trials. For each trial, the mean vector, step size, and covariance matrix were initialized to m (0) ∼ U (S), σ (0) = 2/5 = 0.4, and C (0) = I n , respectively. The true optimal solution of the objective function was set to x * f ∼ U (S) for each trial, and the optimal solution of the problem x * was initialized accordingly. Each trial was terminated when the maximum number of evaluations of the objective function reached 15,000 or when G(m) − x * 2 10 −2 was satisfied. If the latter termination condition was satisfied, it was considered a successful trial as this condition is satisfied only when the feasible subdomain where x * exists is located.
Because infeasible solutions may be generated even with a trained decoder G, we combined CMA-ES and the constraint-handling method ε-ordering [50] to handle such solutions and optimize them. ε-ordering is based on the constraint violation φ(x) = m j=1 max 0, g j (x) , and ranks solutions using the ε-level comparison defined by where ε(t) is defined as Here, φ(x θ ) (θ = 0.2N ) is the value of 20% of the constraint violation amount of infeasible solutions out of the N = 100 solutions generated from the initial search distribution of the CMA-ES, 1 T c is the maximum number of iterations (T c = 0.8 T max ) and cp = 5. The algorithm prioritizes the objective function while allowing constraint violations in the early stages of the optimization, but gradually increases the weight of the constraint function and emphasizes it in the final stages to eventually converge to the feasible domain. The top three methods in the CEC2020 Competitions on Real-World Single-Objective Constrained Optimization that dealt with single-objective constrained problems based on real problems [22], [37], [38] employ ε-ordering and its improved versions. In addition, they are particularly effective for problems where obtaining a feasible solution is difficult. As a baseline, we also show the results of optimization using 1 When using a decoder, many of the generated solutions are feasible. To calculate the constraint violation amount, it is desirable to generate a sufficient number of solutions. Even if numerous solutions are generated, the computational cost is assumed to be sufficiently low because only the constraint functions are evaluated.

C. RESULTS AND DISCUSSION
We first check the effectiveness of the proposed training method, and then we examine the effect of transforming the search domain using the decoder on the optimization results.

1) EFFECT OF COMPONENT-WISE LOSS FUNCTION
In the proposed method, we confirmed that training each sub-model using the loss function defined in Equation (5) leads to the suppression of mode collapse; consequently, a large number of feasible domains X can be trained. For this purpose, we compared the following three approaches: (1) SeqTrain (sequential training) is a method in which sub-models are trained sequentially; (2) SimulTrain (simultaneous training) trains all sub-models simultaneously, but with the same loss function and architecture as SeqTrain; and (3) SingleFix uses the same architecture as SeqTrain and SimulTrain, but trains the entire model using the single loss function defined in Equation (3). By comparing SingleFix with the other two, we can determine the effect of training each sub-model with multiple loss functions.
The feasibility and coverage of the decoder G obtained after training are summarized in Table 2. The SingleFix result shows that feasibility was improved by increasing the coefficient of the penalty term γ , but the coverage was low in both cases, indicating that only a small portion of the feasible domain can be trained. By contrast, the proposed methods, SeqTrain and SimulTrain, which use multiple loss functions, show a significant improvement in coverage compared to SingleFix in most settings. The fact that there is a large difference in coverage between SingleFix and the proposed method, even though the final model size is exactly the same, indicates that the use of multiple loss functions facilitates the learning of complex distributions.

2) EFFECT OF SEQUENTIAL TRAINING
Next, we confirmed the effectiveness of the proposed sequential training. The coverage of SimulTrain is remarkably low when α = 1.6, which is the rate of increase in the penalty coefficient γ k . This is attributable to the fact that the coefficient of the loss function γ k is too large for the models after the middle of the connected models. The coefficients γ k are designed based on the assumption that the value of the penalty term decreases as the method progresses to later models. However, there is no guarantee that the penalty term of later models will be smaller because the mapping ''moving the input slightly in the direction of X'' is not learned in the initial stages of SimulTrain training. In other words, the penalty term of the latter model is likely to be dominant, and mode collapse will occur at this time, as described in Section III. After this occurs, because it cannot be repaired, the coverage converges to 0.02 until the end without recovering.
However, because SeqTrain trains each model sequentially, the possibility of such a problem occurring is low. The results  of SeqTrain confirm that SimulTrain can train approximately 80% of the regions when the coverage is set to 0.02. However, there is a decrease in the coverage compared to α = 1.3; therefore, although it is relatively robust to the increasing rate α, there is a possibility of mode collapse if it is too large.

3) EFFECT ON OPTIMIZATION RESULTS
The feasibility and coverage of the models trained using each method and the success rate after 20 optimizations of the problem using the trained models are shown in Table 3.
The success rate of BruteForce was 0.18, indicating that optimizing the test problem directly was difficult. 2 On the TABLE 2. Results of a typical single trial obtained during three trials in each setting. A similar trend was observed in all trials. α and γ 0 in SeqTrain and SimulTrain, respectively, are parameters that determine the penalty coefficient γ k = γ 0 α k−1 in the loss function L k (Equation (5)) of each sub-model G θ k . The parameter γ in SingleFix represents the penalty coefficient in the loss function (Equation (3)).
other hand, the proposed method achieved a success rate as high as 0.7, indicating that the search domain was transformed using decoder G, which made the optimization process easier. The comparison of the success rates between the methods confirms that the proposed method achieves the highest value. This means that the search domain transformed using the proposed decoder G can be transformed into a space that is easiest (among the compared methods) for the optimization method to search. A visualization of the acquired G is shown in Figure 3, which shows that the proposed method acquires a Voronoi-like map, where the center of each disconnected feasible domain is the representative point. This indicates that the design principle of the proposed method, which states that the nearest z should be projected onto the nearest x, has been realized.
Although the feasibility and coverage of the model obtained by the decoder with SVDD [12] are comparable to those of the proposed method, the success rate is zero. This is because the output G(Z) of the decoder G is degenerate. When the decoder has numerous regions to learn, the output of each region tends to converge to a point in exchange for learning the numerous regions. Consequently, even if it learns the region where the optimal solution is located, it cannot reach the area around the optimal solution.
The results of the decoder with HM [35] show that while the feasibility is 1, the coverage is 0.01, meaning that only one disconnected region of the feasible domain has been learned. In this method, G(z) is calculated via a line search between the feasible solution and latent variable z, which is called the reference point; thus, a feasible solution is always obtained. However, when the feasible domain is disconnected and the measure of the feasible domain is very small compared to the search domain, as in this experiment, it is extremely difficult to find a feasible domain other than the region with the reference point using a line search. This phenomenon is illustrated in Figure 4. Figure 4 shows a heat map of f (G(Z)) for different reference points in a problem with dimensionality n = 2, domain L = 8, and optimal solution of the objective function f (x) = x − x *  TABLE 3. Feasibility and coverage of the decoder G determined by each method and the success rate of optimizing the problem using the decoder G. BruteForce is the result of direct optimization without using the decoder G. For SeqTrain, the model shown in Table 2 was used among the models obtained from three decoder training trials. The success rates were 0.8 and 0.7 when the other two models were used.
onto the region where the reference point is located. In other words, in the 10-dimensional experiment, the coverage was 0.01 because only feasible solutions were obtained in the region where the reference point was located, and the success rate was zero because the reference point did not coincide with the region where the optimal solution was located.

V. APPLICATION TO TOPOLOGY OPTIMIZATION PROBLEM
The experiment outlined in this section was conducted to confirm the effectiveness of the proposed method in transforming optimization problems by using an example TO that is closer to a real-world problem. As mentioned in Section I, TO with volume fraction and perimeter constraints is expected to have disconnected feasible subdomains.

A. MBB BEAM
In the experiment outlined in this section, we used the Python implementation of a classic problem in topology optimization: the symmetric MBB beam [1], [49]. Figure 5 shows the design domain, anchorage of the structure, and location of the external force. The objective of this problem is to determine a structure in a finite design domain that minimizes the distortion of the beam when an external force is applied.
Following the modified SIMP method [7], [55], which is commonly used in TO, the MBB beam was formulated as follows:  where K, U, and F denote the overall stiffness matrix, displacement vector, and external force vector, respectively. The subscript e refers to each element, where k e = k(ρ e ) = E(ρ e )k 0 e is the element's stiffness matrix and k 0 e is the element's stiffness matrix for the unit Young's modulus. ρ is the design variable, and each element ρ e ∈ [0, 1] represents the density. V (ρ) is the volume of the structure, and V * is the upper bound of the volume. The first constraint implies that the structure is not broken by external forces, the second is a volume constraint, and the third is an upper-and lowerbound constraint. The Young's modulus for each element of the design variables is defined by where p is the penalty factor, E min = 10 −9 is a value that prevents the calculation from failing if the element is empty, and E 0 = 1 is the Young's modulus of the solid. In this experiment, the size of the design domain was set to 32 × 32, and the design variable was set to x ∈ S = [−1, 1] n during optimization using CMA-ES, which was converted to [0, 1] 32×32 during the evaluation of the objective function.
In addition to the preceding formulation, we imposed a perimeter constraint defined by where Q : {−1, 1} n → R + is the perimeter length, Q th is the upper limit, and Sign : [−1, 1] n → {−1, 1} n is a function that converts each element of x to −1 if it is less than or equal to zero, and 1 otherwise. The perimeter length constraint imposes a limit on the perimeter length of a structure, which reduces its complexity to the extent that it can be manufactured [4], [24]. We set Q th = 256, and the upper limit of the volume constraint was set such that the ratio of the volume of the structure to that of the entire design domain (= 1024) was 0.2.

B. EXPERIMENTAL SETTING
In this experiment, we used convolutional neural networks (CNNs) as the decoder G. The specifications of the network architecture are presented in Table 4. The number of models was set to K = 10, the initial value of the coefficients of the constraint term was γ 0 = 0.1, and the increasing TABLE 4. Network architecture specifications. The decoder was the composite of K sub-models, G θ = G θ K • · · · • G θ 1 , and each sub-model coefficient was α = 1.3. ADAM [33] was used for training, with a mini-batch size of 64, a learning rate of 0.0001, and β 1 = 0.3. A uniform distribution U (S) was used as the generating distribution P z of z. The number of updates for each sub-model G θ k in the training was set to 100000 iterations. The training data collection method for SVDD and HM was the same as that for Section IV. Because the perimeter constraint is calculated by converting the design variables to binary values, the gradient becomes zero everywhere; this means that the training of the model will not proceed. To avoid this, we used a proxy model V : R n → R + , which approximates the perimeter function Q. The surrogate model V was trained to minimize the mean squared error minimize : E z∼P z (V(G(z)) − Q(G(z))) 2 .
The proxy model V was updated once before updating G θ k in each iteration.
CMA-ES optimization using the trained model was performed for 20 trials for SeqTrain and three trials for SVDD and HM. For each trial, the mean vector, step size, and covariance matrix were initialized as m (0) ∼ U (S), σ (0) = 2/5 = 0.4, and C (0) = I n , respectively. The end condition of each trial was set to be when the maximum number of evaluations of the objective function reached 200000. For other CMA-ES settings, we used the values recommended in [26]. To handle infeasible solutions, CMA-ES and ε-ordering were combined for optimization, as in the experiment in Section IV. As a baseline, we show the results of 20 trials of optimization using CMA-ES with ε-ordering without a decoder (BruteForce).
One of the weaknesses of decoder-based optimization is that it is impossible to obtain a solution that is not included in the image of the trained decoder. In this experiment, after optimizing the decoder obtained by SeqTrain, we also performed hybrid optimization using BruteForce on the obtained results. First, a solution search was performed on the decoder for half the maximum number of evaluations. The resulting solution was then locally searched by BruteForce, with this solution taken to be the initial mean vector m (0) with a small initial step size σ (0) = 0.01. This hybrid approach searches for a feasible solution with a superior objective function value on a global scale using a decoder, and it refines the solution locally in the latter half of the process without being bound by the expressive power of the decoder. Table 5 summarizes the objective function values f (m last ) in the mean vector obtained after MBB-beam optimization. The SIMP method is a gradient-based method widely used in topology optimization. We implemented the SIMP solver published at the same URL [1] as the MBB-beam publication.

C. RESULTS AND DISCUSSION
The results in Table 5 confirm that the minimization of the objective function does not proceed adequately in the optimization using SVDD or HM. The behavior of the volfrac and perimeter constraints for each optimization process illustrated in Figure 6 exhibits almost no change. Because the   same behavior is observed in all three trials, it is highly likely that the output of the decoder G is limited to a very small area, which is why the optimization does not proceed. As discussed in Section IV-C, when the number of disconnected feasible domains in SVDD or HM is large, it is difficult to learn an appropriate decoder, which makes them unsuitable for this type of problem.
The results of BruteForce and SeqTrain in Table 5 confirm that both methods obtain the same or better objective function value compared with that of the SIMP topology optimization method. Note, however, that BruteForce converged outside of the constraint in 4 out of 20 trials; hence, only the average value of 16 trials is shown. The behavior of the optimization process for the three trials of BruteForce in Figure 6 indicates that one trial continued to violate the volumetric (volfrac) constraint. This indicates that the method is converging to outside the constraint. It can be seen that in the failed trials, the clumps are relatively large. To satisfy the volume constraint in this state, it is necessary to decrease each element. However, when a hole is created in the clump, the perimeter length increases rapidly, and such a solution is difficult to select in the optimization process. In other words, it is difficult to escape this situation when the search distribution is small. On the other hand, in a search using a decoder G as in the proposed method, the search domain is generally a feasible domain, so the possibility of convergence to an unconstrained local solution is extremely low. We can also confirm the advantage of the decoder-based approach over BruteForce in terms of the number of f -calls mentioned in Section II. BruteForce did not locate a feasible solution before 10 5 f -calls, whereas the proposed approach already located a feasible solution with a good f -value before 10 5 f -calls.
The advantage of the proposed decoder-based optimization method is best observed in the hybrid method. Although Seq-Train can stably obtain feasible solutions, it cannot search for solutions that cannot be represented by the decoder; therefore, it is difficult to reduce the objective function value even if the search is continued after obtaining a good global structure. On the other hand, in BruteForce, if a feasible domain with a good global structure can be found, a small objective function value can be obtained; however, the rate of finding such a feasible domain is small, and the optimization results show a large variation from trial to trial. When a hybrid method is used, a stable and excellent global structure is found, and the objective function value can be reduced by subsequent refinement.

VI. CONCLUSION
In this study, we proposed a decoder-based constrainthandling method that transforms an explicitly constrained BBO problem (Equation (1)), where the feasible domain X is a union of several disconnected regions, into an optimization problem (Equation (2)), where the search domain is feasible almost everywhere. We designed a method to represent the mapping (decoder) from the latent space Z, which is a convex set, to the disconnected feasible domain X using a deep neural network, and to learn it without using training data. In the proposed method, the decoder is represented as a composite of multiple sub-models, each of which has skip connections. The loss function of each sub-model is defined independently, and the sub-models are trained sequentially. Numerical experiments using a test problem and a topology optimization problem show that the proposed method can find feasible domains with better objective function values and higher probability than both conventional decoderbased constraint-handling methods and non-decoder-based constraint-handling methods.
We recommend the hybrid method, as shown in Section V, as a way to use the decoder proposed in this study. First, the decoder was trained using the proposed method. Because the training of the decoder does not require the evaluation of the objective function value, which is computationally expensive, we can obtain a decoder G * with sufficient performance by adjusting the training parameters using the ratio of feasible solutions as an indicator. Next, the trained decoder G * was used to solve the minimization problem defined by Equation (2) to obtain the solution z * . This allowed us to globally search for a feasible domain with a good objective function value. However, because the decoder does not necessarily represent the entire feasible domain, there is room for improvement. Finally, we locally solved the minimization problem defined in Equation (1), with G * (z * ) as the initial solution, to refine the solution within the feasible domain where G * (z * ) exists. The topology optimization results (Section V) show that the hybrid method can stably obtain a feasible solution with a lower objective function value than optimizing Equation (1) or Equation (2) alone.
There are three elements in the proposed method that must be determined by the user. The first is the initial value, γ 0 , of the loss function's penalty term's coefficient γ k and its rate of increase, α. As shown in Section IV-C, if γ k is larger than necessary, mode collapse may occur, and G may learn only a part of the region. To prevent this, it is recommended that the initial value γ 0 and the rate of increase α be minimized. However, as such values depend on the scale of the constraint function, investigating the guidelines for determining these hyperparameters will contribute to performance improvement. The second is the architecture of G. To obtain a better G, it is desirable to select an architecture that considers the characteristics of the task. In the experiments in this study, we chose a network with fully connected layers for a toy problem and a network with convolutional layers for a topology optimization problem. For the latter, we used prior knowledge that the design variable in our topology optimization problem is naturally treated as a 2D grayscale image and that convolutional layers are known to excel at handling image formats in the machine learning field. Unless such prior knowledge is available, fully connected layers with number of nodes a few times greater than the input dimension n may be the default choice. Choosing a reasonable network architecture may sometimes require expert knowledge in neural networks. The third is the termination condition of the training. In the experiments conducted in this study, the end condition of each connected model was unified at 200k iterations. However, for different tasks and models, this may result in inadequate training. Currently, only the convergence curve of the loss function and the feasibility ratio of the model output guide the termination condition, and these need to be determined by the user through trial-and-error. However, this adjustment requires practical experience in deep learning, which is a weakness of the proposed method. If there is an index to evaluate the goodness of the model, such as the spread and continuity of the model output, the adjustment of the proposed method will be easier and the quality of the acquired model will be improved. These are important issues to be addressed in the future. KAZUTO