A Survey on Evolutionary Construction of Deep Neural Networks

—Automated construction of deep neural networks (DNNs) has become a research hot spot nowadays because DNN’s performance is heavily inﬂuenced by its architecture and parameters, which are highly task-dependent, but it is notoriously difﬁcult to ﬁnd the most appropriate DNN in terms of architecture and parameters to best solve a given task. In this work, we provide an insight into the automated DNN construction process by formulating it into a multilevel multiobjective large-scale optimization problem with constraints, where the non-convex, nondifferentiable, and black-box nature of this problem make evolutionary algorithms (EAs) to stand out as a promising solver. Then, we give a systematical review of existing evolutionary DNN construction techniques from different aspects of this optimization problem and analyze the pros and cons of using EA-based methods in each aspect. This work aims to help DNN researchers to better understand why, where, and how to utilize EAs for automated DNN construction and meanwhile, help EA researchers to better understand the task of automated DNN construction so that they may focus more on EA-favored optimization scenarios to devise more effective techniques.


A Survey on Evolutionary Construction of Deep Neural Networks
Xun Zhou , A. K. Qin , Senior Member, IEEE, Maoguo Gong , Senior Member, IEEE, and Kay Chen Tan , Fellow, IEEE Abstract-Automated construction of deep neural networks (DNNs) has become a research hot spot nowadays because DNN's performance is heavily influenced by its architecture and parameters, which are highly task-dependent, but it is notoriously difficult to find the most appropriate DNN in terms of architecture and parameters to best solve a given task.In this work, we provide an insight into the automated DNN construction process by formulating it into a multilevel multiobjective large-scale optimization problem with constraints, where the nonconvex, nondifferentiable, and black-box nature of this problem make evolutionary algorithms (EAs) to stand out as a promising solver.Then, we give a systematical review of existing evolutionary DNN construction techniques from different aspects of this optimization problem and analyze the pros and cons of using EA-based methods in each aspect.This work aims to help DNN researchers to better understand why, where, and how to utilize EAs for automated DNN construction and meanwhile, help EA researchers to better understand the task of automated DNN construction so that they may focus more on EA-favored optimization scenarios to devise more effective techniques.Index Terms-Automated design of DNNs, deep neural networks, evolutionary algorithms, optimization.

I. INTRODUCTION
D EEP neural networks (DNNs) are one of the most pow- erful machine learning techniques nowadays, deriving the reviving and boom of artificial intelligence in recent years [1]- [3].They are characterized by sophisticated taskoriented models (in the form of networks), which allow the most effective feature representation to be learned in a taskdriven manner from the data of various types, such as images, texts, and time series [4].The concept of DNNs appeared in the 1970s.After that, the development of DNNs continued but did not attract much attention.The starting point for the boom of DNNs occurred in 2012 when a specific DNN model named AlexNet, armed with the high computing horsepower of graphics processing units (GPUs), achieved the record-breaking classification performance on the ImageNet dataset [5].Since then, DNNs have received ever-increasing attention, leading to the emergence of a new era in machine learning, namely, deep learning.
DNNs have different types of architectures suitable for dealing with different types of data.For example, convolutional neural networks (CNNs) are apt at learning features from the data with certain local structures such as images [4].Recurrent neural networks (RNNs) are good at learning the temporal behavior from sequence data such as time series [4].For the same type of DNNs, there also exist various kinds of models.For example, AlexNet [5], VGG [6], Inception [7], and ResNet [8] are popular CNN models while LSTM and GRU are commonly used RNN models [9].It is well known that DNN's performance depends on both model architecture and model parameters.However, in practice, it poses great challenges to find the most suitable DNN model in terms of architecture and parameters to best solve a given task because it corresponds to solve a highly complex large-scale optimization problem of nonconvex and black-box nature [10].
The traditional way to address this issue assumes the model architecture is manually specified and leaves the task of learning model parameters to be solved by using a selected and configured model learner.This often leads to the suboptimal performance due to a lack of sufficient expert knowledge and human labor to make the best choice from a vast number of possible model architectures, model learners, and their associated parameters.Recent years have seen exponentially growing efforts in both academia and the industry on studying automated DNN construction techniques that aim to automatically determine the best-performing model architecture and parameters for a given task [10]- [15].Such techniques commonly perform search-based optimization about model architecture, model parameters, and model learners, where optimization of model parameters is via a model parameters learner, such as stochastic gradient descent (SGD) and evolutionary algorithms (EAs) and nested into optimization of the model architecture via a model architecture learner, such as reinforcement learning (RL) [11], [12], EAs [13]- [15] and SGD [10].
There exist quite a few survey papers on relevant topics, focusing on a specific type of optimization problems such as model architecture optimization (commonly known as "neural architecture search") [135]- [137], a specific type of DNN models such as CNNs [138], or a specific type of optimization techniques, such as RL [139] and EAs [140], [141].Different from the existing survey papers, this work provides a comprehensive and systematic review of using EAs to address various optimization problems involved in the automated DNN construction process based on an insight into this process from the perspective of optimization.It also includes discussions on the pros and cons of EA-based techniques compared to non-EA-based ones in different optimization scenarios.We aim to help DNN researchers to better understand why, where, and how to use EAs for automated DNN construction and also help EA researchers to better understand the task of automated DNN construction so that they may focus more on EA-favored optimization scenarios to devise more effective techniques.
Our major contributions include the following.1) Define and analyze various optimization problems involved in the automated DNN construction process, revealing the motivations of using EAs as the solver.2) Provide the taxonomy and survey of existing EAbased techniques in different optimization scenarios and discuss the pros and cons of EA-based techniques compared to non-EA-based ones in every scenario.3) Summarize applications, challenges, and trends regarding the evolutionary DNN construction, where the publicly available datasets and codes used in studies are reported in a supplementary document, which, to the best of our knowledge, was never provided in existing survey works.The remainder of this article is organized as follows.
Section II provides an insight into automated DNN construction from the perspective of optimization.Section III introduces several fundamentals of evolutionary DNN construction.Sections IV-VII review evolutionary DNN construction techniques in different optimization scenarios, including model parameter optimization, model architecture optimization, model learner optimization, and miscellaneous.Section VIII summarizes applications, challenges, and trends about evolutionary DNN construction.Finally, Section IX concludes the article.

II. INSIGHTS INTO AUTOMATED DNN CONSTRUCTION
Automated DNN construction aims at finding the most appropriate DNN model architecture and parameters to best solve a given task, which typically requires dealing with the following optimization problems.
Model architecture optimization, widely known as the neural architecture search (NAS), pursues the most appropriate architecture to solve a given task by optimizing, with respect to a set of decision variables determined by the representation of the model architecture, one or more objectives that evaluate model performance from different aspects, such as accuracy and time efficiency.This optimization problem typically has a very large discrete search space due to a wide variety of model architectures.Also, it has a black-box nature because it usually lacks explicit mathematical functions that directly formulate the mapping from architecture to performance.Furthermore, the performance of a specific model architecture depends on the associated model parameters.As a result, this optimization problem is inherently bilevel, where architecture optimization is the upper level task, and parameter optimization is the lower level task (which is nested within the upper level task).
Model parameter optimization seeks the best parameters (i.e., connection weights and biases) for a model with a prespecified architecture to best solve a given task by optimizing, with respect to a set of decision variables determined by the representation of model parameters, an objective that evaluates the effectiveness of model parameters based on model's performance on training data, the so-called training loss.This optimization problem typically has a prohibitively large continuous search space due to the extremely large number of model parameters.Also, it is highly nonconvex, featuring a very challenging search landscape full of local optima.
Model (architecure and parameter) learner optimization targets at finding the most effective learners (intrinsically optimizers) to best solve the above two optimization problems, respectively.It belongs to the problem of optimizing optimizers, i.e., searching the best optimizer [together with its best-calibrated parameters, also known as (a.k.a.) model hyperparameters] to maximize its performance on solving an optimization problem.In fact, most of the existing studies on automated DNN construction did not explicitly consider solving this problem because it will lead to a multilevel optimization problem, which is seldom feasible to be solved in practice due to the prohibitively expensive computational cost.
By considering all the above aspects, we formulate automated DNN construction, provided a training set D trn and a validation set D val (used to measure the model's generalization performance [10], [17]) for the task to be solved, into the following multilevel multiobjective optimization problem defined in (1), shown at the bottom of p. 4. M a , M p , M la , and M lp denote the set of decision variables corresponding to the representations of model architecture, model parameters, model architecture learner, and model parameter learner, respectively.The constraints applied to each of them at different optimization level are used to define the feasible search space of decision variables.This optimization problem has four levels with one to four representing the recursively nested upper to lower levels that correspond to model architecture learner optimization, model architecture optimization, model parameter learner optimization, and model parameter optimization, respectively.
Both model architecture learner optimization and model architecture optimization may have multiple objectives defined on the validation set D val , e.g., validation accuracy (generalization performance estimation) and time efficiency.Model parameter learner optimization and model parameter optimization have one objective defined on the training set D trn , which is typically the training loss.
The solution to this optimization problem is a Pareto optimal set, denoted by P(M a , M p , M la , M lp ), which contains multiple nondominated solutions. 1Eventually, one of these nondominated solutions is chosen.
Solving this multilevel multiobjective optimization problem with constraints is seldom feasible in practice even by using the most cutting-edge modern computing facilities due to its large-scale, nonconvex, nondifferentiable (in some cases of model architecture optimization), and black-box nature.Accordingly, existing research works have focused on solving some simplified versions of this problem, e.g., optimizing model parameters M p given the fixed model architecture M a and model parameter learner M lp , optimizing model architecture M a together with model parameters M p given fixed model learners M la and M lp , and optimizing model learners M la and M lp inside the model parameter and/or architecture optimization process.Furthermore, traditional mathematical optimization techniques such as gradient-descent-based methods become incompetent to handle this problem due to its nonconvex, nondifferentiable, and black-box nature.This fact makes EAs to gain increasing attention and popularity as a promising solver because of their featured capabilities to solve nonconvex, nondifferentiable, and black-box optimization problems, as well as the rapid advance in high performance computing, which mitigates the notorious computational bottleneck of EAs.
This work intends to provide the taxonomy and survey of the existing evolutionary DNN construction techniques applied to model parameter optimization (Section IV), model architecture optimization (Section V), and model learner optimization (Section VI), respectively, which are the simplified versions of the optimization problem defined in (1).Furthermore, Section VII discusses miscellaneous relevant to evolutionary DNN construction, e.g., optimization in terms of objectives and speedup.

III. OVERVIEW OF EVOLUTIONARY DNN CONSTRUCTION
Evolutionary DNN construction studies the use of EAs for automated DNN construction.To help better understand such techniques, we provide an overview of DNNs and EAs followed by a short introduction to the fundamentals and motivations of evolutionary DNN construction techniques. 1 Considering minimization problems, two candidate solutions x 1 and x 2 , x 1 dominates x 2 : O i (x 1 ) ≤ O i (x 2 ), ∀i∈(1, . . ., m) and O j (x 1 ) < O j (x 2 ), ∃j∈(1, . . ., m).When x 1 and x 2 cannot dominate each other, they are nondominated solutions.

A. DNNs
DNNs are characterized with the powerful feature representation learning capability due to their sophisticated hierarchical architectures, which can extract a hierarchy of feature sets at different representational levels to address a specific task, such as classification and regression [4].
Different types of DNN architectures have been designed to deal with different types of data.For the same type of architectures, there exist many different types of models.The deep belief network (DBN) is good at data with independent features [142].CNN is designed for data with local structures, such as images and videos [4].RNN is apt at sequence data, such as texts and time series [4].To configure these models for use, the number of different types of layers (e.g., fully connected layers, pooling layers, and convolution layers) and the parameters in each layer (e.g., the number of neurons in a layer, kernel size, stride size, and the padding value in pooling and convolution layers) need to be determined.Also, for some DNN architectures such as ResNet, topological connections across layers need to be determined.Some DNN architectures are designed for special tasks.For example, stacked autoencoder (SAE) [143] with the symmetrical architecture, i.e., encoder and decoder, is designed for learning feature representations via encoding and decoding, where the architectures of the encoder and the decoder need to be designed.Generative adversarial network (GAN) [144], [206] is designed to produce new plausible data via co-learning two DNNs, i.e., the generator and the discriminator, which may have different architectures.
Besides the model architectre, DNN's performance also depends on model parameters.Typically, model parameters are optimized via gradient-based methods, such as SGD, Moment, Adagrad, and Adam [145].However, such methods are prone to get stuck into local optima due to the high nonconvexity of the loss function they optimize.Furthermore, their performance is sensitive to their own parameters (a.k.a.model hyperparameters), such as learning rate, batch size, and decay rate.

B. EAs
EAs are the population-based metaheuristic search algorithms inspired by the nature and biology [16].This article does not distinguish between EAs and swarms intelligence algorithms strictly because they typically follow a general framework composed of initialization I, evaluation E, reproduction R, and selection S modules.Specifically, to solve an optimization problem, a population of candidate solutions P = {x 1 , . . ., x N } is initialized at first.Then, each candidate solution x i is evaluated to calculate its fitness value f .New candidate solutions (a.k.a.children) are generated via reproduction from the candidate solutions chosen from the population based on fitness elitism (a.k.a.parents).The reproduced candidate solutions are evaluated to obtain their fitness values.Finally, a new population is formed by selecting elicit candidate solutions with higher fitness values from the combination of the reproduced candidate solutions and the old population.This process is repeated until some termination criteria are met.
Then, the individual in the population, which has the best fitness value, is selected as the output.A general framework of EAs is described in Algorithm 1.
Different EAs have been proposed under different natural or biological inspirations, e.g., genetic algorithm (GA) [60], genetic programming (GP) [75], evolution strategy (ES) [146], differential evolution (DE) [92], particle swarm optimization (PSO) [147], ant colony optimization (ACO) [65], and artificial bee colony (ABC) [42], which mainly differ in one or more modules, i.e., I, E, R, and S, in the framework.For initialization, I is applied to initialize candidate solutions.There are two commonly used types of initialization methods, i.e., random initialization [52] and knowledge-based ones [56].For evaluation, E is applied to calculate the fitness value of the candidate solution [82].For reproduction, R is applied to generate new candidate solutions based on some candidate solutions chosen from the current population.For example, in GA, recombination is used to exchange some parts of two candidate solutions to produce two new ones [52].In ES, mutation is used to alter some parts of a candidate solution to a prespecified degree [148].In PSO, a candidate solution (a.k.a.particle) is changed via some position and velocity updating formula [57].For selection, S is used to select promising candidate solutions to form the new population, typically based on fitness values.For example, by the tournament selection [72], two candidate solutions are chosen randomly from the population and the one with the higher fitness is added to the new population.There are also other selection methods like roulette wheel selection [59].

C. Evolutionary Construction of DNNs
The evolutionary construction of DNNs applies EAs to automatically construct DNNs.Given the insights into the automated DNN construction process, EAs are inherently suitable for solving some optimization problems involved in the process.For example, EAs are well known for the capability to solve black-box nonconvex optimization problems [17].Also, multiobjective EAs (MOEA) have been intensively studied to deal with multiple conflicting objectives in an effective way [70], [149].Moreover, EAs are highly parallelization and thus may benefit from the rapid advance in high-performance computing to accelerate computational speed [19], [20].
EAs have a long history of being applied to design and configure neural networks, dated back to the 1990s [150].With {P, F} ← S({P, F} {C, F c }) 12: end while 13: Selecting from P the individual with the best fitness the renaissance of neural networks in the form of deep learning, where the starting point is commonly regarded as the invention of AlexNet in 2012, EAs have been applied to automatically construct DNNs in terms of DNN parameters [151], DNN architecture [59], and DNN learners [131].Particularly, for DNN architecture design (commonly known as NAS), EAs have demonstrated promising performance [13], [69], achieving the state-of-the-art accuracy on many benchmark test problems, e.g., CIFAR-10, CIFAR-100, and ImageNet.

Algorithm 1 General Framework of EAs
In the following sections, we will review and categorize existing works on the applications of EAs to different optimization problems involved in automated DNN construction, as analyzed in Section II.Also, we will discuss the pros and cons of EA-based techniques compared to non-EA-based ones when solving different optimization problems.

A. Problem Statement
Model parameter optimization, a.k.a.DNN training, searches the best model parameters M * p for a DNN with fixed architecture M † a to best solve a given task via optimizing, with respect to (w.r.t.) model parameters M p , an objective function L that measures the performance of the model (defined by M † a and M p ) on training data D trn by using a manually specified and configured model parameter learner M † lp .This optimization problem, simplified from that defined in (1), can be formulated as It is a large-scale (likely with millions of model parameters) and highly nonconvex continuous optimization problem.
In the following, we will provide a systematical review of EA-based methods applied to solve this problem and then discuss the pros and cons of EA-based methods in comparison to non-EA-based ones.

B. Taxonomy and Survey of Existing EA-Based Approaches 1) Taxonomy:
The existing EA-based approaches for model parameter optimization mainly differ in the representations of parameters and the search paradigm of EAs.In this part, we will categorize these works from two aspects, i.e., solution representations and search paradigms.
2) Solution Representations: For EAs, the solution is often represented via a certain encoding scheme.In the following, we are going to review the commonly used solution representation schemes under two categories, i.e., direct encoding and indirect encoding.
1) Direct Encoding: In this scheme, each model parameter is directly represented as it is.All model parameters are represented as a vector [29], [33], [40], [43], where each element in the vector denotes one specific model parameter.2) Indirect Encoding: In this scheme, each model parameter is represented in an encoded form obtained via a certain mapping [25], [27], [37], [52].For example, binary encoding is often used in GA [25], [27], where each model parameter is represented via a bit string.In [37], network weights are represented by a set of coefficients, which are applied to linearly combine some predefined basis vectors to produce network weights.In [52], the encoding is via the mean and variance of Gaussian distribution, and the weights of DNN are sampled from this distribution.3) Search Paradigms: Search paradigms refer to the frameworks of search algorithms.From this view, EA-based approaches can be divided into two categories, i.e., pure EAs and hybrid EAs, according to whether gradient-based methods are incorporated.
Pure EAs merely rely on EA-based approaches to solve the model parameter optimization problem.
1) Basic Evolution: In this framework, model parameters are optimized by following the general EA framework depicted in Algorithm 1 [25]- [30].2) Cooperative Co-Evolution: In this framework, the model parameters are decomposed into several groups where each group is optimized by the EA in a separate but cooperative way.Finally, the optimal model parameters found in each group are combined to produce the solutions [21]- [24].In [22], two approaches were proposed to decompose model parameters, i.e., synapse based and neuron based, where groups correspond to the single connection weight and all connection weights for a neuron, respectively.Model parameters in different groups Fig. 1.Neuron-based cooperative co-evolution, where the connection weights for neuron 1 and neuron 2 are partitioned into two separate groups (w 1 and w 2 ) and (w 3 , w 4 , and w 5 ).Each group is evolved by the EA separately while evaluation of any individual in a group's population is performed in a cooperative way.For example, when evaluating the individual (0.2 and 0.1) from the group corresponding to neuron 1, it will be combined with the best individual found so far, i.e., (0.5, 0.3, and 0.4), from the group corresponding to neuron 2, and evaluated as a whole.The combination of the best individuals from each group forms the final output.
are optimized by EAs separately and when an individual in a group is to be evaluated, cooperative evaluation is applied to concatenate this individual with the best individuals from the other groups for the assembled model parameters.After optimizing model parameters in all the groups, the finally obtained model parameters are typically defined as the combination of the best individuals from each group.An example for the neuron-based cooperative co-evolution is shown in Fig. 1 for readers to better understand the framework.Hybrid EAs define combining EAs with the gradient-based methods to optimize model parameters.There are three common frameworks for making hybridization.
1) The first is applying the EA to find the best model parameters, which are then further optimized by using the gradient-based method to generate the final solution [31]- [36].
2) The second is applying the gradient-based method to first produce sets of model parameters, which are then used to initialize the population of the EA to let the EA keep searching for the best model parameters [39], [40].
3) The final is using the EA and the gradient-based method alternatively where the output of one method servers the start point for another, and this procedure is iterated until some stopping criteria are met.Existing works mainly differ in the order of applying these two methods in the iterative procedure and the individuals chosen from being continually optimized by using the gradient-based method [38], [41]- [43], [47].For example, in [42], the EA is applied first, and the top 10% individuals with higher fitness in the final population of the EA are further optimized by the gradient-based methods.In [38], the model is trained by using the gradient-based method until the performance improvement is below a certain threshold.Then, the EA is applied to further optimize model parameters.The best solution produced by the EA will be used as the starting point for applying the gradient-based method again.This process is repeated until certain criteria are met.

C. Discussion
Gradient-based methods [145] and EAs [152] are the two commonly used techniques for model parameter optimization.The gradient-based method is computationally efficient but may easily get stuck into inferior local optima due to its local search nature.The EA may mitigate the issue of getting stuck into inferior local optima due to its global search nature but suffer from high computational cost [153].
Three suitable optimization scenarios for EA-based methods are summarized as follows.
1) EAs are suitable for training DNNs with small size, such as RNNs and DBNs.For example, in [21]- [24], EAs are applied to train the RNN, where direct encoding is used to represent model parameters and the cooperative coevolution is used as the search paradigm.Because of the global search nature of EAs, the models trained by EAs achieve better performance than the ones trained by the gradient-based methods.2) EAs can be combined with the gradient-based methods to train large-scale DNNs.To deal with a DNN with millions of parameters the pure EA search paradigm becomes less competent.In this situation, the hybrid EA search paradigm becomes promising, where the EA is combined with the gradient-based method in some way to train the DNN.The improvements contributed by EAs can be seen from two aspects: a) EAs can explore the promising regions in the search space, which assist the gradient-based method to more effectively exploit the best solution and b) the global search nature of EAs can help the gradient-based method to jump out of inferior local optima.3) When the exact gradient information of the loss function is hard to be obtained, EAs can be used to train the DNN.A typical scenario is in the deep RL (DRL) tasks, where due to the sparse rewards, exact gradient information often cannot be obtained to update the policy network's model parameters.Therefore, in recent years, EAs have been applied to train the policy network following the pure EA search paradigm.From the experiment results reported in [27]- [30], [45], [154], and [155], the policy networks trained by EAs have demonstrated promising performance in various test scenarios.

A. Problem Statement
Model architecture optimization, a.k.a.NAS, is a bilevel optimization problem [10], [156], where the lower level task of model parameter optimization is nested within the upper level task of model architecture optimization.It may involve more than one objective functions O i , i = 1, . . ., m that measure the performance of the model on validation data D val from different aspects, such as accuracy and speed, leading to multiobjective optimization.
Assuming a manually specified and configured model architecture learner M † la and a manually specified and configured model parameter learner M † lp are used, this optimization problem, simplified from that defined in (1), can be formulated as ( It is a very challenging bilevel optimization problem where the upper level architecture optimization is multiobjective, nondifferentiable (in many cases), and black-box while the lower level parameter optimization is large-scale and nonconvex.The solution to this problem is a Pareto optimal set P(M a , M p ) composed of nondominated solutions.Eventually, one of these nondominated solutions will be chosen and deployed in use.
In the following, we will provide a systematical review of EA-based methods applied to solve this problem and then discuss the pros and cons of EA-based methods in comparison to non-EA-based ones.

B. Taxonomy and Survey of Existing EA-Based Approaches
1) Taxonomy: Similar to the model parameters optimization, existing EA-based approaches mainly differ from the solution representations and search paradigms.Also, the customized operators are designed to search the complex model architecture more effectively.Therefore, in this section, from the three aspects, i.e., solution representations, search paradigms, and customized search operators, EA-based approaches for the model architecture optimization are categorized.
2) Solution Representations: DNNs have a wide variety of architectures suitable for dealing with the tasks of different types and complexity.In general, DNN's architecture is determined by two factors, i.e., architectural units and topological patterns that define the connection between different units.Once these two factors are specified, a certain architecture is determined.Therefore, existing solution representations in model architecture optimization are often defined with respect to these two factors, where the architectural unit and the topological pattern are encoded as decision variables (in the representation) in certain ways.Then, some constraints, e.g., search ranges, are set for them, to define the search space within which model architecture optimization is carried out.Existing architectural units can be categorized into microunits and macrounits, where the former denotes the basic operational unit in the model, e.g., a convolutional layer, and the latter denotes the complex operation unit composed of multiple basic ones, e.g., a residual block.Existing topological patterns can be categorized into fixed and nonfixed ones.For the fixed pattern, once architecture units for a model are specified in an ordered way, their connections are directly determined.In contrast, the connections between architectural units need to be specified for the nonfixed pattern.In the following, we categorize the solution representations used in existing works about model architecture optimization according to fixed and nonfixed topological patterns, where architectural units will be discussed therein.We do not make categorization for architectural units because there are too many of them, and the choice of them for defining the search space is problem dependent and needs to consider the balance between accuracy and computational time.
Fixed Topological Patterns: The solution representations following this pattern are merely determined by architectural units.Fig. 2 illustrates an example solution representation following this pattern, where a simple CNN is represented in the vectorial form via the concatenation of the representations of multiple basic architectural units, i.e., convolutional, pooling, and fully connected layers, and each basic unit is encoded via its type and parameters such as the kernel size.Sometimes, a same unit is repeated multiple times and the number of repeated times can be encoded in the presentation.For such solution representations, as long as the order of a unit is determined, i.e., its position in the representation, its connections to other units are determined.
Nonfixed Topological Patterns: The solution representations following this pattern are determined by both architectural units and topological patterns.In other words, the connections between architectural units, e.g., the skip connection [58], [60], [83], [98], need to be encoded and optimized.The adjacent matrix [58], [83], [98] has been commonly used to represent the connections between architectural units, where a matrix element with the value of 1 denotes the existence of the connection between the two units indexed by the row and column Fig. 3. Solution representation of a simple CNN of the similar Inception network type by using the adjacency matrix and adjacency list for representing its connections, respectively, denoted in red.The adjacency matrix is encoded for the entire network as a binary vector via serialization, where 0 and 1 represent the nonexistence and existence of a connection.The adjust list is encoded for each unit as a list of the indices of the units connecting to it.
indices of that matrix element.In most works, the adjacency matrix is encoded as a binary vector via serialization [59], [60].One limitation of using the adjacency matrix is that the total number of architectural units needs to be prefixed so that the matrix can be formed.Therefore, it is not suitable for the cases where the number of architectural units is a decision variable to be optimized.To address this issue, the adjacent list [61]- [63], [158] was proposed as a list of units connecting to a specific unit.It allows easily adding or removing units while considering the connections to them.Fig. 3 illustrates the solution representation of a simple CNN of the similar inception network type by using the adjacency matrix and adjacency list for representing the connections, respectively, where the representation of the unit follows the way described in the part of fixed topological patterns.
For both fixed and nonfixed topological patterns, the architectural units can be macrounits, which are more commonly known as cells in existing works.In cell-based solution representations [15], [53], [64]- [67], [67]- [69], [71], [79], [159], the encodings of connections among cells follow the ways described in the parts of fixed and nonfixed topological patterns.The cell itself can be treated as a "small" DNN model, which can also be represented in the way following either fixed or nonfixed topological patterns.This kind of solution representation may reduce the solution space by simply stacking the same cell in a certain way and merely optimizing this cell, which has achieved very promising performance on CIFAR-10 and CIFAR-100 [67].Fig. 4 illustrates the solution representation of a simple cell-based CNN.
In addition to solution representations in the vectorial form as described above, there are other forms of representations.For example, decision variables can be partitioned into multiple levels and different levels are optimized in a hierarchical and cooperative way [13], [66], [86], [89], [125].The tree-structure representations [75]- [77] are commonly used in model optimization methods based on GP [75], where the leaf node receives the input data and the other nodes perform some basic operations, such as convolution, max pooling, and summation.
3) Search Paradigms: Model architecture optimization requires evaluating the quality of candidate architectures, where learning model parameters is typically indispensable when carrying out evaluation.Therefore, it usually needs to consider both architectures and parameters during the model architecture optimization process.Existing EA-based approaches in this regard often employ EAs for architecture optimization while using gradient-based methods for parameter optimization, differing in the ways to search for the best architecture.In the following, we describe the commonly employed search paradigms in existing works about applying EAs for model architecture optimization.
1) Basic Search Paradigm: This paradigm, as illustrated in Fig. 5, follows the general framework of EAs in Algorithm 1 but incorporates a gradient-based parameter learning process before evaluation.Specifically, a population of candidate model architectures are initialized at first.Then, model parameters for these architectures are learned via the gradient-based method on the training data, followed by evaluating the quality of candidate architectures on the validation data.After that, reproduction is applied to some selected existing candidate architectures to generate new candidature architectures, which are trained by using the gradient-based method and then evaluated.Next, a new population is formed by selecting elicit candidate architectures from both the current population and the new candidature architectures generated from it.This new population will be evolved by repeating the above steps until certain termination criteria are met.Many existing works [18], [74], [80]- [82] follow this search paradigm.2) Incremental Search Paradigm: This paradigm is similar to the basic search paradigm in terms of the operational pipeline.The major difference is that it incrementally constructs the model by gradually adding components (i.e., different types of layers and connections) to it as the population evolves [13], [48], [160].Fig. 6 illustrates an example of using this search paradigm to construct a simple CNN via a simple evolutionary strategy [160].In this example, the initial population contains one candidature architecture with one convolutional layer.During reproduction, different adding operators, i.e., adding the convolutional layer, pooling layer, and skip connection, are applied to incrementally construct the architecture.Then, the newly generated candidate architectures are trained and evaluated.Next, the best of them will be selected to form the new population, which will be evolved by repeating the above steps until certain termination criteria are met.Compared to the basic search paradigm, the incremental one allows searching for part of the model architecture in different search stages, and thus, reduces the computational cost [160].However, it can only generate the complete architecture at the end of the search process while the candidate architectures in the basic search paradigm are always complete during the entire search process.3) Cooperative Co-Evolution Paradigm: In this search paradigm, the model architecture is decomposed into subcomponents (i.e., macrounits) according to a "blueprint," which defines the topological patterns of subcomponents.It is equivalent to the cell-based solution representation.Then, the blueprint and its associated subcomponents are optimized together in a certain way to search for the optimal model architecture.For example, in [86] and [87], the model architecture is Fig. 7.
Example of the decomposition-based search paradigm, where a population of blueprints and a population of its subcomponents are co-evolved [86].decomposed into two parts, i.e., the blueprint and its subcomponents, which are co-evolved in a cooperative way to search for the optimal architecture.Specifically, a population of blueprints and a population of subcomponents are evolved separately.To evaluate the quality of an individual in either population, a collection of model architectures is generated by randomly sampling from both populations and assembling the sampled blueprints and subcomponents.Then, these model architectures are trained and evaluated.Next, the quality of those evaluated model architectures that contain a blueprint or subcomponent will be averaged to estimate the quality of that blueprint or subcomponent.Fig. 7 illustrates this process.4) Hypernet-Based Search Paradigm: This search paradigm typically has two stages, i.e., pretraining and searching stages [14], [83], [161], [162].In the pretraining stage, a hypernet that subsumes all possible candidate model architectures is specified and trained, where the gradient-based methods with various tricks, such as single path [83] and random search [162], have been proposed and used for training.The pretrained hypernet will be used to guide the subsequent searching stage, where any search paradigm described previously can be used to search for the optimal model architecture.For example, a population of candidate model architectures are sampled from the pretrained hypernet, where the model parameters of the sampled model architectures are directly inherited from the hypernet.Then, these sampled architectures evaluated, where the parameter learning process before evaluation is omit-After that, reproduction is applied to some selected existing candidate architectures to generate new candidature architectures.The newly generated architectures may directly inherit their model parameters from the pretrained hypernet and then get evaluated.Next, a new population is formed by selecting elicit candidate architectures from both the current population and the newly generated candidature architectures.This new population be evolved by repeating the above steps until certain termination criteria are met.Fig. 8 illustrates this example.5) Multiobjective Optimization: In this search paradigm, multiple (often conflicting) objectives drive the search process and evolutionary multiobjective optimization techniques [163] are often employed to search for a set of Pareto optimal solutions.Finally, one of these Pareto optimal solutions will be chosen for deployment according to practical needs.This search paradigm can be combined with any of the previously discussed search paradigms.It becomes more and more commonly used in practice, where optimization of model architecture is subjected to some factors other than accuracy, e.g., inference time and energy consumption [15], [49], [60], [70], [73], [85], [91]- [94], [96], [159].For example, in [15], MOEAs are used to optimize model architectures by considering model accuracy and model size that is hostable by the hardware on which the model will be deployed.In [70], both the accuracy and inference time of the model are considered in the MOEA to optimize model architectures so that the obtained optimal architecture may satisfy the practical requirement on latency.4) Customized Search Operators: For model architecture optimization, most of the search operators, e.g., initialization, evaluation, reproduction, and selection, in EAs can be well applied to the solution representations described previously.For example, in [95], the architectures with different layers are randomly generated as initial candidate solutions.In [95] and [52], recombination is applied to swap parts of the two selected models to generate two new architectures.In [13], mutation is applied to alter the configuration of a layer, e.g., the kernel size and stride.In [72], architectures with higher accuracy will be selected into the population of the next generation.However, merely relying on the basic search operators in EAs may not be enough due to the nature of architecture search, e.g., varying length solution representation, invalid architecture, and model parameter inheritance.In the following, the customized operators widely used in existing works about EA-based model architecture optimization are categorized and described.
1) Customized Recombination: The basic recombination operator assumes the solution representations of two parents have an equal length.However, it is typical that EA's population used for model architecture optimization contains individuals of various lengths, denoting architectures of various sizes.Therefore, many existing works [52], [54]- [57], [80], [82], [95], [96] have designed customized recombination operators that can handle varying length solution representations.For example, in [57] and [52], only the layers with the same positions in the two parents can be swapped.In [54], the cutting point is randomly selected in each of the two parents and then the right parts of two cutting points are swapped.Some quality control methods can be applied to the recombination operator to improve its effectiveness.For example, in [55], the cutting points are only allowed to be inserted into the predefined positions to prevent invalid architectures from being generated.2) Customized Mutation: The basic mutation operator is used to alter the values of select decision variables within their feasible ranges, e.g., changing the kernel size.To enable search for architectures of different sizes, many existing works [13], [48], [52], [55], [56], [67], [80], [97] have designed customized mutation operators that can change the size of the architecture by adding or removing its parts (encoded as decision variables).For example, the mutation operator was used in [13] to insert/remove convolution layers and skip connections into/from the parent architecture.In [55], the mutation operator was used to replicate a specific layer.Some quality control methods can be applied to the mutation operator to improve its effectiveness.For example, in [48] and [97], if a new architecture generated by the mutation operator performs worse than its parent(s), the added or removed components will be revoked.3) Repairment: When a new architecture is generated via recombination and/or mutation, it may be invalid, and thus, needs to be discarded or repaired.For example, when a new convolutional layer is added into an existing architecture, its input and/or output might be inconsistent with the output of the preceding and/or the input of the succeeding layers, resulting in an invalid new architecture [72].The repairment operator is designed to modify an invalid architecture to make it become valid.For the previous example, it can be used to adjust the input and/or output of the newly added convolutional layer [72].Invalid architectures may have different kinds and degrees of invalidity due to different ways to generate them, leading to various repairment operators.Notably, repairment operators allow designing more flexible search operators without having to strictly consider the validity of their outputs.4) Inheritance: During search, any newly generated architecture (child) typically reserves part of its architecture from its parent(s).In addition, the model parameters in the reserved part of its architecture may also be inherited from its parent(s).For example, when two selected parents exchange parts of their architecture during recombination, both the architecture and its associated model parameters will be swapped [46], [97], [125].When a selected parent undergoes mutation to generate a child, the child will inherit the unmutated part of the architecture as well as its associated parameters from its parent [48].Model parameter inherence may prevent training newly generated models from scratch and thus, reduce computational costs [125].Among the above-discussed customized search operators, both recombination and mutation intend to generate new architectures based upon the old ones.In comparison, recombination typically leads to significant architectural changes from parents to children and thus, suits exploration of innovative architectures that may result in much improved performance [98].Mutation typically leads to incremental changes to the old architecture and thus, suits exploitation of existing architectures.Furthermore, recombination is more heavily dependent on the encoding scheme and the repairment operator.Therefore, mutation is more widely used in existing works [13], [48], [67], [68].
RL-based approaches typically follow the incremental search paradigm, where the policy used for incremental architecture generation is designed and learned via feedback (rewards) to gradually search for the best architecture.Similar to EA-based ones, RL-based approaches also have a vast architecture search space and require time-consuming parameter learning, leading to demanding computational costs.Furthermore, they typically employ a certain policy network (DNN), which involves many hyperparameters and is not easy to train.Moreover, although the two types of approaches can be both applied to multiobjective scenarios, RL-based ones usually need to first convert the multiobjective problem to the single-objective one in some ways [166], [167], and thus, cannot produce the Pareto optimal set as EA-based approaches do.
In recent years, gradient-based approaches have been proposed, aiming to reduce computational costs, where model architecture optimization is formulated as a continuous optimization problem.For example, the DARTs approach proposed in [10] relaxes the discrete architecture search space to a continuous one, by mixing possible candidate operations, so that the architecture can be optimized via gradient descent in an efficient way.In comparison to EA-based ones, such approaches are much more efficient, but they are less component to explore novel architectures because all possible architectures that could be found need to be manually predefined via the solution representation.
EA-based approaches have been notorious for their high computational costs.For example, it may take about 3000 GPU days to find a desirable architecture [13].In recent years, many computational speedup strategies have been proposed from the perspectives of algorithmic design and computing power, which will be discussed in Section VII.They allow EAbased approaches to achieve satisfactory performance at much reduced computation costs like several GPU days [15], [83].In addition, EA-based approaches can easily and effectively deal with various constraints and multiple objectives, and inherently allow model ensemble due to its population-based nature to achieve better generalization [49], [101].

A. Problem Statement
Model architecture and parameter learner optimization seeks the most effective learners (intrinsically optimizers), aiming to best solve model architecture optimization and model parameter optimization problems, respectively.Specifically, model architecture learner optimization searches the best architecture learner (including its best-calibrated parameters, a.k.a.model hyperparameters) by optimizing, w.r.t.architecture learner's representation M la , the performance of the learner on solving the model architecture optimization problem defined in Section V.This is equivalent to the optimization problem defined in (1), which is prohibitively challenging.Model parameter learner optimization searches the best parameter learner (including its best-calibrated parameters) by optimizing, w.r.t.parameter learner's representation M lp , the performance of the learner on solving the problem of model parameter optimization for a DNN model with fixed architecture M † a .It can be formulated as the following bilevel optimization problem: Most of the existing studies did not consider model architecture learner optimization due to its expensive computational cost.In the following, we will focus on model parameter learner optimization and provide a review of EA-based methods applied to solve it, and then discuss the pros and cons of EA-based methods in comparison to non-EA-based ones.

B. Taxonomy and Survey
In model parameter learner optimization, the learners used for model parameter optimization, e.g., gradient-based methods, are optimized in terms of both its types and parameters.The decision variables involved in such an optimization problem are often not many and may take either discrete (for types) or continuous (for parameters) values.Also, there exists no explicit mathematical formulation of the objective function of decision variables in such an optimization problem.Therefore, EAs are a good choice for solving this problem, where direct encoding is often used for solution representations.
EAs have been applied to optimize the parameters of gradient-based methods [19], [20], [132].For example, in [132], they are applied to optimize the learning rate, momentum, and batch size for the gradient-based approach Adam.EAs have also been used to both choose the most appropriate learner and search the best parameters for the learner.For example, in [131], ES is applied to select the learner from Adam and Adadelta and optimize the parameters of the chosen learner.In some works, optimization of learner's parameters is integrated with the model architecture optimization process [51], [53], [55], [127]- [130], [133], [134], [168], [169].For example, in [127], the EA is applied to design a VGG model, where the parameters of a prespecified model parameter learner are encoded together with the model architecture for solution representations.

C. Discussion
In existing works, the model architecture learner is typically prespecified instead of being optimized due to the practically prohibitive computational cost, even higher than that of NAS, for solving the model architecture learner optimization problem.As for model parameter learner optimization, it is a black-box, mixed-variable optimization problem as discussed previously.Bayesian optimization (BO) and EAs, as the two most representative derivative-free optimization methods, have been applied to solve it.Among them, BO does not rely on heavy trial-and-error exploration but is less effective for handling mixed decision variables and constraints.In contrast, EAs are more suitable for solving the mixed-variable optimization problem with constraints [131], [168].Although EAs are very time consuming, many computational speedup strategies have been proposed in recent years from the perspectives of both algorithmic design and computing power, which will be discussed in Section VII.They allow EAbased approaches to achieve satisfactory performance at much reduced computation costs [19], [20], [134].

VII. EVOLUTIONARY DNN CONSTRUCTION: MISCELLANEOUS
In addition to the three major optimization problems discussed in the previous sections, this section describes how EAs are applied to solve other optimization tasks involved in the automatic DNN construction process.Furthermore, existing works about two key factors relevant to optimization, i.e., speedup and objectives, are summarized.

A. Other Optimization Tasks
Besides model parameter, architecture, and learner optimization, evolutionary DNN construction involves some other optimization tasks.For example, in [170], the EA is used in the data preprocessing step to select more useful features for the input images to improve the performance of the DNN on edge detection.Furthermore, a DNN may employ different loss functions, leading to different performance.The loss function used by a DNN is typically designed or specified according to human expertise instead of in a problem-driven way.To address this issue, EAs have been applied to optimize DNN's loss functions [171]- [173], [206], [208].For example, in [171], the EA is applied to optimize the misclassification cost to improve DNN's performance for solving imbalanced classification problems.In [173], a specific loss function is designed by the EA to speed up model parameter optimization via fewer training steps to achieve higher accuracy.In [208], the EA is applied to optimize the loss function to make policy learning in RL suitable for the dynamic environment.

B. Optimization Speed-Up
Most optimization problems involved in automated DNN construction are computationally demanding, mainly due to the vast and complex search spaces for DNN architecture and parameters.Existing works aiming at optimization speedup can be categorized from the aspects of software and hardware, which are described below in terms of algorithm design and computation power, respectively.
1) Algorithm Design: From the viewpoint of the algorithm design, the commonly used speed-up techniques in literature are as follows.
1) Parameter Sharing: This kind of techniques is often used in transfer learning [64], [174]- [176], which uses the knowledge (i.e., model parameters) learned from solving one task to help solve another problem.For EA-based approaches, through the inheriting operator described in Section V, the offspring model can inherit model parameters from its parents.The inherited model parameters can be used as a warm start to prevent the offspring model from being trained from scratch [13], [125], [177].Besides the inheriting operator, a more general parameter-sharing technique is the hypernet framework [14], [15], [83], [161], [162].Specifically, a hypernet, which subsumes all the candidate model architectures in the search space, is trained first.Then, subnetworks sampled from the hypernet will inherit its model parameters, and these models can be evaluated directly.As a result, the time cost is reduced significantly.

2) Training Cost Reduction:
Early stopping is a widely used strategy to reduce the training cost.For example, in [88], the number of training epochs is set as 5 for each candidate models to reduce training time.
Another commonly used strategy is to first perform model optimization on a small dataset at the expense of the small cost, and then refine the obtained model on a large dataset.It can much save the computational cost compared to directly performing model optimization on the large dataset.For example, in [10] and [67], the model architecture is first optimized on CIFRA-10, and then, the obtained best architecture is applied with possible further refinement on ImageNet.This kind of speed-up techniques may sacrifice performance to some degree because the model optimized in this way is often suboptimal.3) Performance Prediction: Predicting the performance of a DNN may help speed up model architecture and/or parameter optimization.For example, different machine learning models [102], [160], [178] have been used to predict the performance of a model under training to determine whether the model still deserves to be further trained [103].In recent years, proxy models have been proposed to learn a mapping from model architecture to model performance [126], [179], [180].As such, the huge computational burden of parameter learning for a specific model architecture can be avoided to reduce the computational cost in NAS. 4) Cell-Based Framework: Many existing works search for the entire DNN at once, which is often computationally demanding due to the very large size of the DNN.The cell-based framework is used to speed up the construction of DNNs.In this framework, a DNN is supposed to be composed, in a certain pattern, of multiple small components with the same architectures, the so-called cell [64], [104].As a result, the automated construction of an entire DNN is transformed to optimize a cell in a much reduced search space, leading to significant computational speedup.2) Computation Power: Taking advantage of the available computational resources is also an effective way to speed up optimization.A straightforward method is applying the free or cheap resources, such as the cloud [181] and volunteer computers [182].But these resources are often limited, and considering the security of private information, access to these available computational resources is also constrained.
Parallel computation is an effective way to take the most advantages of the computational resources [20], [105], [183].EAs as the population-based methods are nature for the parallel framework.Recently, in [20], an asynchronous parallel framework is proposed upon two parts, i.e., the controller and the workers.The controller is responsible for search operators to produce offspring and update the population.The workers are applied to train and evaluate these individuals, which require a considerable amount of compute resources.When a worker finishes the training and evaluation of one individual, it will send back the fitness and receive a new model from the controller.Meanwhile, the controller will update the global information and population according to the information it has received.Compared with the parallel process, in the asynchronous parallel framework, the devices can continually deal with the population rather than keep pace with each other, improving the efficiency of limited compute resources.

C. Optimization Objectives
Accuracy is the most intuitive optimization objective for the evolutionary construction of DNNs.In addition, there are other objectives that need to be considered in practical applications.This section will discuss these other objectives studied in the literature.
1) Inference Time: It measures the time required for the input propagating forwards through the network to produce the output, which is a crucial factor to be considered in real-time applications such as velocity prediction in autopilot [184].This measurement is relevant to the model architecture and the computing device that the model is deployed [70], [97].2) Computational Complexity: It estimates the computing speed of an algorithm in terms of the number of floatingpoint operations (FLOPs) [70], [94] or multiply adds operators [73], [159].3) Space Complexity: It measures the amount of working storage required by an algorithm, and can be roughly estimated via the number of model parameters [15], [94], the number of connections, or the sparsity of the model [92].This measurement is related to the overfitting and underfitting of the model.4) Energy Consumption: It measures the average energy consumption for the model to make inference on the input data, which can be roughly estimated via the peak power, average power, and the running time of CUDA kernel functions [167], [184].This measurement is vital for the scenarios where the construction of DNNs is for devices with the limited energy capacity such as mobile phones.

A. Applications
Evolutionary DNN construction approaches, as elaborated in Section IV-VII, have demonstrated successes in various applications.In the following, we summarize these applications from three dimensions, i.e., data types, application fields, and deployment scenarios.
Application Fields: The automated DNN construction opens the door for researchers and engineers from different fields and with little expertise and experience in DNNs to better utilize the DNN to resolve the problems arising in their fields.For example, in biomedical engineering, EAs have been applied to design DNNs for disease diagnosis [81], [116]- [119], protein structure prediction [186], and sleep study [187].In mechanical engineering, EAs have been used to design DNNs for failure detection [114], [115], robot control [188], and remaining useful life prediction [49].In addition, evolutionary DNN construction has been applied to gamma-ray detection [110], traffic flow prediction [189], and electricity price forecasting [79].
Deployment Scenarios: With the advancement of IoT and 5G, more and more devices, e.g., smartphones, vehicles, and drones, have benefited from various DNN-powered applications.The DNNs deployed on such devices typically need to carefully consider model size and complexity to meet real-world constraints on computational latency, energy consumption, etc.This poses extra challenges to evolutionary DNN construction approaches, e.g., to solve a multiobjective, instead of single-objective, optimization problem.For example, EAs have been applied for model compression so that the compressed model can work well on the device without suffering from computational resource and memory storage issues [73], [184], [190], [207].Also, MOEAs have been applied to design the DNN by considering multiple conflicting objectives (e.g., model size and accuracy) at the same time, and then users can select, from the finally obtained Pareto optimal set, the most suitable model according to the hardware environment in practice [70], [97], [107].

B. Challenges
Evolutionary DNN construction has achieved great successes, but also come with some unsolved challenges to be further addressed.In the following, some key challenges are summarized and discussed.
Tradeoff Between Optimization Cost and Model Performance: Evolutionary DNN construction approaches typically have high computational costs.In recent years, many speedup strategies have been proposed from the perspectives of both algorithmic design and computing power, as discussed in Section VII.However, they may impose some side effects on model performance.For example, when the search space is purposely limited, the chance of finding novel architectures becomes small.The one-short design paradigm cannot avoid unreliable model ranking, which may lead to undesirable design outcomes.Performance prediction approaches may wrongly discard promising candidate models due to the inaccuracy of predictive models, leading to undesirable results.As such, making a good tradeoff between optimization cost and model performance remains a challenge to be further studied.
Effectiveness of Model Architecture Optimization Methods: Recent studies revealed that many mainstream optimization methods do not differ much from random search in their performance of solving the model architecture optimization problem [191]- [194].On some tasks, random search even outperforms the others.Therefore, it becomes necessary to comprehensively and systematically evaluate and compare existing model architecture optimization methods, aiming to reveal its suitable and unsuitable application scenarios.This endeavor is more important than keeping proposing new but less understood methods.
More Challenging Tasks: In existing works, many DNN models constructed by EAs are based on the datasets of small size, e.g., MNIST and CIFAR-10, because of low computational costs.These constructed models may achieve nearly perfect accuracy of 99% and 97% on MNIST and CIFAR-10, respectively, due to the low difficulty of these tasks.As a result, further improvement becomes more technically difficult but less practically useful.On the other hand, when the same approach is applied to construct the DNN on large-scale datasets, such as ImageNet and CIFAR-100, its computational cost may become prohibitively high.Furthermore, there is no guarantee that the effectiveness of the approach demonstrated on the small-scale dataset can be retained on the large-scale dataset.As a result, datasets of small size but high difficulty are demanded.Also, the performance sensitivity of an approach to the size of its used dataset should be investigated.
Model Architecture Learner: The effectiveness of model architecture optimization depends upon the employed optimization method, a.k.a.model architecture learner.To seek the best performance of model architecture optimization, it is an intuitive to think of finding the best model architecture learner, e.g., finding the most suitable population size and maximum generation number for EA-based learners.However, as discussed in Section VI, this task corresponds to a multilevel optimization problem, which is computationally prohibitive, much higher than model architecture optimization per se.This big challenge calls for further investigations.

C. Trends
In this part, some popular research trends in the evolutionary DNN construction are discussed.
Benchmark Platform for Model Comparison and Development: Existing works usually involve comparison of different evolutionary DNN construction methods.However, these compared methods often have distinct search spaces and thus, their intrinsic capabilities for DNN construction are different, i.e., the best model obtained by one method can never be found by another method.To address this issue, a special benchmark for NAS, named Nas-bench-101, was proposed in [195], where all possible DNN architectures with respect to a prespecified search space for solving a specific task, i.e., image classification on CIFAR-10, are fully trained in advance.When methods are compared on this benchmark, they can employ the same search space to guarantee fairness of comparison.Furthermore, there is no need to invoke the time-consuming parameter learning process during architecture optimization because all candidate architectures have been pretrained.Therefore, this kind of benchmarks removes the computational bottleneck and allows the researchers to be able to focus on designing and developing more effective optimization methods.However, one benchmark merely corresponds to a certain search space and a certain task, which cannot cover problems at various difficulty levels.To deal with this issue, more and more benchmarks of different search spaces and tasks being solved have been proposed, e.g., Nas-bench-201 [196], NASBench-301 [197], TransNAS-Bench-101 [198], and HW-NASBench [199].
Design of Architecture Search Space: The intrinsic capability of an evolutionary DNN construction method heavily depends on the search space, which is often manually specified.Recently, there emerges a growing interest in automatically designing the search space of model architecture [200]- [202].On the one hand, a simpler search space may facilitate the optimization process carried out therein.For example, in [201], by fitting a linear function between decision variables, the size of the search space is reduced, leading to better search results.On the other hand, a more powerful search space that contains more effective architectures therein may inherently boost the effectiveness of the optimization process carried out therein.For example, in [202], EAs are applied to design the search space automatically.In this work, the search space itself is formulated as the candidate solution which is evolved by the EA, where the quality of a specific search space is estimated via the average quality of DNNs sampled from it.The best search space eventually found by the EA is expected to allow any optimization method performed therein to produce high-quality models.
Handling Insufficient Annotated Data: Evolutionary DNN construction typically requires a fairly large amount of annotated data to enable the optimization process, e.g., the training set for model parameter optimization and the validation set for model architecture optimization.However, in practice, the amount of available annotated data is often limited.It poses challenges to many existing works.Furthermore, the unannotated data are usually available but not well utilized.Recent years have seen many works on addressing these issues, e.g., the metalearning [203], unsupervised [204], and selfsupervised [205] NAS techniques, which have demonstrated very promising results and deserve further investigations.

IX. CONCLUSION
In this work, we formulated automated DNN construction into a multilevel multiobjective optimization problem with constraints, analyzed this problem to gain deep insights, and provided a comprehensive review of EA-based approaches to solving this problem, mainly from the aspects of model parameter optimization, model architecture optimization, and model learner optimization.Furthermore, we discussed the pros and cons of EA-based approaches in comparison with other commonly used approaches in different optimization scenarios as well as two essential factors in optimization, i.e., computational speedup and optimization objectives.Moreover, we summarized the applications, challenges, and trends in this area of study.As discussed in Section I, this work is different from existing survey works.It aims to help DNN researchers to better understand why, where, and how to use EAs for automated DNN construction and also help EA researchers to better understand the task of automated DNN construction so that they may focus more on EA-favored optimization scenarios to devise more effective techniques.Furthermore, we summarized the publicly available datasets and code used in relevant studies and provided them in the online supplementary document.

Fig. 2 .
Fig.2.Solution representation of a simple CNN using the fixed topological pattern, where the layers from the front to the end of the CNN are encoded via the type (C: convolutional, P: pooling, and F: fully connected) and parameters (the kernel size and # of neurons in a layer) of the layer and concatenated orderly to form a vectorial representation.

Fig. 4 .
Fig. 4. Solution representation of a cell-based CNN, which are formed by stacking m types of cells with each cell, denoted by Cell i (i = 1, . . ., m), repeating N i times, where the connections in the cell are encoded via the adjacent matrix-based representation.