Estimation of Distribution Algorithms in Machine Learning: A Survey

—The automatic induction of machine learning models capable of addressing supervised learning, feature selection, clustering and reinforcement learning problems requires sophisticated intelligent search procedures. These searches are usually performed in the possible model structure spaces, leading to combinatorial optimization problems, and in the parameter spaces, where it is necessary to solve continuous optimization problems. This paper reviews how the estimation of distribution algorithms, a kind of evolutionary algorithm, can be used to address these problems. Topics include preprocessing, mining association rules, selecting variables, searching for the optimal supervised learning model (both probabilistic and nonprobabilistic models), ﬁnding the best hierarchical, partitional or probabilistic clustering, obtaining the optimal policy in reinforcement learning and performing inference and structural learning in Bayesian networks for association discovery. Interesting guidelines for future work in this area are also provided.


I. INTRODUCTION
Currently, machine learning is the branch of artificial intelligence that is receiving the most investment and development, at the methodological level and in terms of innovation and application.This is due to the increase in accessibility to databases in diverse domains such as medicine, energy, industry and smart cities.In all these domains, the modelling process that must be carried out by the appropriate machine learning paradigm -supervised learning, clustering or reinforcement learning-usually consists of searching for the model (including its structure and parameters) that best fits the data or yields the best performance.This model search is usually carried out in spaces with large cardinalities, i.e., exponential or superexponential in the number of variables.
Examples of machine learning problems where these large spaces arise are as follows: (i) feature subset selection, where the number of possible feature subsets f (n), n being the number of variables, is given by [1]: f (n) = 2 n ; (ii) partitional clustering, where the number of possible partitional clustering assignments S(N, K) of N objects into K groups is given by [2]: S(N, K) = 1 f (n) = n i=1 (−1) i+1 n i 2 i(n−i) f (n − i), for n > 2, which is initialized with f (0) = f (1) = 1; and (iv) permutation problems, as in the case of the optimal triangulation of the moral graph associated with a Bayesian network [4] with n nodes, whose cardinality space is n!.In all these examples, searches are performed in discrete spaces and correspond with discrete highly complex optimization problems.
Moreover, finding the best parameters of a machine learning model is usually associated with continuous optimization problems.Examples are the solutions of the likelihood equations in a logistic regression model [5], for which there is no closed analytical expression, or the parameters of an artificial neural network that are usually searched by means of a backpropagation algorithm [6], which often becomes stuck in local optima.
The examples described above belong to the category of NPhard problems, justifying the use of heuristics in the search for optimal solutions.Classical optimization methods cannot solve those problems on any modern computer within reasonable time.The optimization heuristics that have been used in the literature to search for the best machine learning model and its parameters range from deterministic heuristics, such as the sequential forward (backward) feature selection (elimination), greedy hill climbing, best-first, floating search, tabu search and branch and bound algorithms, to nondeterministic heuristics with single solutions such as the simulated annealing, greedy randomized adaptive search procedure and variable neighbourhood search algorithms, and with population-based metaheuristics, such as the scatter search, and evolutionary algorithms (genetic algorithms [7], [8], ant colony optimization [9], particle swarm optimization [10], estimation of distribution algorithms [11], differential evolution [12], evolutionary programming [13], genetic programming [14] and evolution strategies [15]).
Estimation of distribution algorithms evolve individuals in the population in a special manner as opposed to other metaheuristics.This is why these algorithms will be the focus of this survey.Their advantages follow [11].First, the evolutionary process is stochastic and based on (solid) probability theory, allowing to scape from local optima.Second, it is not necessary to design ad hoc operators for each problem, which is usually required in other evolutionary computation methods.Third, the interactions (in terms of conditional independence) between the variables that encode each individual can be seen explicitly in a probabilistic graphical model learned from the selected population at each generation.This enables the search process interpretation.Fourth, it is possible to This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/incorporate expert knowledge of the optimization problem in that graphical model, e.g., by forcing some must-link (or must-not-link) variables as part of the graph.Fifth, the study of the convergence results of different variants of estimation of distribution algorithms is facilitated by the mathematical ductility of some formulations and probability foundations [16].The last three advantages are only present in estimation of distribution algorithms, unlike any other metaheuristics.
Evolutionary machine learning refers to the use of evolutionary algorithms for solving machine learning problems.A number of surveys and review papers have been published on this topic and focus on evolutionary algorithms designed for particular machine learning tasks, such as feature subset selection [17], association rule mining [18], [19], classification trees [20], deep learning [21], [22], [23], clustering [24], [25], [26] and association discovery with Bayesian networks (including estimation of distribution algorithms) [27].Other surveys have covered evolutionary algorithms applied to a number of machine learning tasks [28], [29], [30].As regards [28], it is a brief overview of evolutionary computation in classification, regression and clustering.Instead, [29] and [30] are devoted to multiobjective evolutionary algorithms for machine learning.
This paper surveys the use of estimation of distribution algorithms in machine learning: (i) preprocessing tasks such as the optimal rearrangement of rows and columns of tables and multivariate discretization; (ii) association rule mining; (iii) supervised learning tasks, such as feature subset selection, and knearest neighbours modelling, classification trees, rule induction, artificial neural networks, logistic regression, Bayesian classifiers, metaclassifiers and regression; (iv) clustering methods, such as hierarchical clustering, partitional clustering, and probabilistic clustering; (v) reinforcement learning; and (vi) inference and structure learning in Bayesian networks for association discovery.Real applications solved by machine learning methods based on estimation of distribution algorithms are also shown.In addition, the use of estimation of distribution algorithms is justified by the interpretability of the probabilistic graphical models learned in each generation of the algorithm, which will also be emphasized.To the best of our knowledge, no survey exists on the estimation of distribution algorithms for any machine learning task.This is the main motivation behind our work, which covers all the above-mentioned tasks.
The rest of this article is organized as follows.Section II provides background information on the estimation of distribution algorithms for discrete and continuous optimization domains with a brief introduction of Bayesian networks.This is necessary for understanding the estimation of distribution algorithms with multivariate dependencies.Estimation of distribution algorithms for multiobjective optimization are also presented.From Section III to Section VIII, approaches based on these algorithms in different machine learning tasks are reviewed.In particular, preprocessing, association rule mining, supervised learning, clustering, reinforcement learning and association discovery with Bayesian networks are discussed in Sections III, IV, V, VI, VII and VIII, respectively.Estimation of distribution algorithms applied in real machine learning problems are found in Section IX.Finally, Section X presents conclusions and future work.

A. Evolutionary Algorithms
Evolutionary algorithms [31] comprise a set of heuristic techniques that aim to solve (combinatorial or continuous) optimization problems by computationally reproducing the principles of natural evolution proposed by Darwin in 1859 [32].The search for the optimal solution is carried out by evolving a population of individuals, each of which represents a possible solution to the optimization problem.While genetic algorithms [7], [8] are the best known examples of evolutionary algorithms, other techniques, such as evolutionary programming [13], evolution strategies [15], genetic programming [14], ant colonies [9], differential evolution [12] and estimation of distribution algorithms [11] have also been developed and used in a large number of real-world applications.

B. General Scheme of Estimation of Distribution Algorithms
In estimation of distribution algorithms (EDAs) [11], [33], [34], [35], [36], [37], a population of candidate solutions is evolved by estimating the probability distribution underlying the individuals selected in each generation according to their fitness (objective function value) and represented as a probabilistic graphical model.Then, this probability distribution is simulated to obtain the new population of candidate individuals (see Figure 1).For a recent survey, see [38].Algorithm 1 is a general, unified pseudocode for all variants of EDAs introduced in Section II-D.The initial population of individuals is randomly drawn by the simulation of a probability distribution defined in the search space.If prior knowledge about the problem is available, it can be used to avoid an uninformative sample distribution, i.e., a uniform distribution.

Initial population of candidate solutions
In the first step, each of the M randomly obtained individuals is evaluated, and a fixed number of them, N , are selected according to a previously established criterion.This criterion can be deterministic (such as selecting the N individuals with the best evaluation functions) or stochastic (incorporating a random selection, where the selection probability for each individual is proportional to its evaluation function).
In the second step, the joint probability distribution of the selected individuals is estimated using different assumptions.In the literature on EDAs, three situations are considered: (a) the variables are independent; (b) only the bivariate dependencies are taken into account; and (c) the dependence degree between the variables is not restricted.In the vast majority of optimization scenarios that occur in the real world, assumption (a) is far from reality.On the other hand, assumption (c) can be computationally expensive in problems with a large number of variables.
In the third step, the probability distribution learned in the previous step is simulated to obtain a new population of individuals.In constrained optimization problems, it must be guaranteed that the simulated individuals verify the constraints in the simulation process.
These three steps (evaluation, estimation and simulation) are repeated until a previously determined stop condition is met.This condition can refer to the number of simulated generations, the convergence of the population of individuals towards the global optimum or even the maximum acceptable execution time (or number of iterations).In most EDAs that do not restrict the dependence relationships between variables, the joint probability distribution is estimated by a Bayesian network (Section II-C) learned from data.EDAs have also been developed with the probability distribution estimated from log-linear probability models [39], probabilistic principal component analysis [40], Kikuchi approximations [41], Markov networks [42], [43], Markov chains [44], copulas and vines [45], a reinforcement learningbased method [46], Gaussian adaptive resonance theory neural networks [47], growing neural gas networks [48], restricted Boltzmann machines [49], [50], [51] and in the deep learning area, from autoencoders [52], variational autoencoders [53], [54], and generative adversarial networks [55].Model selection in EDAs is a more complex problem.In [56], this problem was addressed based on variable transformations.The authors found a variable transformation technique that implicitly captures higher-order interactions and then uses lowdimensional models in the new transformed space (with easier parameter estimation).
Theoretical issues of EDAs (convergence analysis and runtime analysis) have been primarily addressed in algorithms that assume independence between variables in discrete [57], [58], [59] and continuous domain optimization approaches [60], and limited attempts have been made to study the behaviour of EDAs that do not restrict the dependence relationships between variables [61].See [16] for a survey on this topic.A quantification of the genetic drift effect in EDAs appears in [62].

C. Bayesian Networks
Bayesian networks for discrete variables.For combinatorial optimization problems, EDAs are usually based on Bayesian networks.A Bayesian network [63], [64], [65] is an interpretable compact representation of the joint probability distribution (JPD) p(X 1 , ..., X n ) over a set of variables X 1 , ..., X n .Conditional independence between triplets of variables is the central concept in Bayesian networks; it allows the JPD to be represented in a compact manner and with fewer parameters.Two random variables X and Y are conditionally independent given another random variable Z if p(x|y, z) = p(x|z), ∀x, y, z values of X, Y, Z, that is, whenever Z = z, the information Y = y does not influence the probability of x.
Suppose that we find a subset Pa(X Then the JPD can be factorized as with a (hopefully) substantially smaller number of parameters.This modularity permits easy maintenance and efficient reasoning.
The Bayesian network has two main parts.The qualitative part, by means of a directed acyclic graph (DAG), represents the conditional (in)dependencies between variables.The quantitative part contains the conditional probability tables (CPTs) of each discrete variable X i given any possible instantiation of its parent variables (variables from which arcs with destination X i result), Pa(X i ), in the DAG. Figure 2 shows a hypothetical example taken from [66] of a Bayesian network, modelling the risk of dementia.All variables are binary: x denotes 'presence' and ¬x denotes 'absence', for Dementia D, Neuronal Atrophy N , Stroke S and Paralysis P .For Age A, a means 'aged 65+'; otherwise the state is ¬a.Note that in the Bayesian network structure both Stroke and Neuronal Atrophy are influenced by Age (their parent in the DAG).These two conditions influence Dementia (their child).Paralysis is directly associated with having a stroke.The CPTs show the Bayesian network parameters and indicate the specific conditional probabilities attached to each node.For instance, if someone has neuronal atrophy and has had a stroke, there is a probability of 0.96 that the person will have dementia: p(d|n, s) = 0.96.However, in the absence of neuronal atrophy and stroke, this probability is only 0.10, i.e., p(d|¬n, ¬s) = 0.10.Thus, the JPD p(A, N, S, D, P ) requires 2 5 − 1 = 31 parameters to be fully specified.However, using this Bayesian network which provides the following factorization: p(A, N, S, D, P ) = p(A)p(N |A)p(S|A)p(D|N, S)p(P |S), only 11 input probabilities are needed.
Bayesian networks for continuous variables.For continuous optimization problems, EDAs are based on Gaussian Bayesian networks [67], [68].In a Gaussian Bayesian network, it is assumed that the associated JPD for X where µ = (µ 1 , . . ., µ n ) T is the vector of means, Σ is the n × n covariance matrix and |Σ| is its determinant.
The JPD in a Gaussian Bayesian network can be equivalently defined by the product of n univariate (linear) Gaussian conditional densities where µ i is the unconditional mean of X i (i.e., the ith component of µ), v i is the conditional variance of X i given values for x 1 , ..., x i−1 and β ij is the linear regression coefficient of X j in the regression of X i on X 1 , ..., X i−1 .
Learning of (Gaussian) Bayesian networks.The learning and simulation of Bayesian networks and Gaussian Bayesian networks, which correspond to steps 2 and 3 of Algorithm 1, respectively, are performed by similar methods regardless of whether we are working in discrete or continuous optimization scenarios.
Learning Bayesian networks [69], [70], [71] (and Bayesian Gaussian networks) from data can be achieved using two different approaches.On the one hand, constraint-based methods are used to statistically test conditional independencies among triplets of variables from data.A DAG that represents a large percentage (and whenever possible all) of identified conditional independence constraints is provided as the output of this type of algorithm.The most representative method is the PC algorithm [72].The PC algorithm, which starts with a complete (all nodes are connected) undirected graph, has two stages.In stage 1, the adjacencies in the graph (the skeleton of the learned structure) are output using edge elimination through hypothesis testing (such as the χ 2 test or the G 2 test).For any pair of adjacent nodes X i and X j in the graph and for a subset Z of the adjacent nodes of X i , edge X i -X j is removed if and only if Z renders X i and X j conditionally independent.This elimination process is carried out recursively, and the number of variables in the conditioning part, Z, of the hypothesis test increases in each step.In stage 2, the orientation of the edges and their transformation into arcs are the focus.Constraint-based methods are very general and easily adaptable for learning Gaussian Bayesian networks.However, very few EDAs have been developed based on these methods.Indeed, each time the number of variables in the conditioning part to carry out the hypothesis tests goes up, the cardinality of the dataset from which to learn the structure of the model increases considerably, greatly slowing down the evolutionary search process.
On the other hand, in score and search based methods, attempts are made to intelligently search the DAG spaces to maximize a given criterion (a number of heuristics have been applied for this purpose).A large number of criteria (Akaike information criterion [73], Bayesian information criterion (BIC) [74]...) are based on penalized likelihood.The penalty is defined by taking the complexity of the evaluated structure (its number of parameters) into account.This is necessary because otherwise the search would end with the complete model (all nodes linked with the rest of the nodes).Other criteria such as the K2 score [75], or the Bayesian Dirichlet equivalence (BDe) score [76] are associated with marginal likelihood, following a Bayesian perspective.Interestingly, the score should be decomposable, that is, it should be expressed as a sum (or product) of values that depend on only one node and its parent nodes.
Simulation of (Gaussian) Bayesian networks.The simulation of Bayesian networks (or Gaussian Bayesian networks) is carried out in step 3 of Algorithm 1.While most of the EDAs implemented in real applications have used the simulation method called probabilistic logic sampling [77], other methods such as likelihood weighting [78], or Gibbs sampling [79] can also be considered.In probabilistic logic sampling we use ancestral node ordering, i.e., we sample a node X i after sampling from all its parent nodes Pa(X i ) which results in a fixed value pa(x i ) (forward sampling scheme).Efficient sampling schemes to promote the visit of promising regions and avoid premature convergence have been recently proposed for Gaussian Bayesian networks [80].
In Figure 1 and Algorithm 1 the fitness value of each of the simulated individuals in the corresponding (Gaussian) Bayesian network is not considered as another node.This is how the vast majority of EDA algorithms work, i.e., without the evaluation or fitness variable appearing explicitly in the model learned in each generation.Exceptions to this general trend are found in [81] where the model learned in each generation is a Bayesian classifier (a special case of a Bayesian network) whose class variable is defined as the variable containing the evaluation of the individuals, and in [82], where in a many-objective optimization problem each objective to be considered is represented as a variable in the Gaussian Bayesian network.

D. Categorization of EDAs
Discrete EDAs are the name given to EDA-based optimization algorithms designed to solve combinatorial optimization problems.Instead, optimization in continuous domains is addressed by continuous EDAs.Discrete [36] and continuous [83] EDAs are categorized according to the probabilistic dependency relationships allowed for p l (x) in Algorithm 1.In [84], principles for a proper adaptation of discrete EDAs to the continuous case are presented.
Without dependencies.The joint n-dimensional probability distribution of the selected individuals in each generation is assumed to factorize as the product of n one-dimensional probability distributions, one per variable.Namely, p l (x) = n i=1 p l (x i ), where p l would be replaced by densities in continuous EDAs.The main representatives are: (a) the univariate marginal distribution algorithm (UMDA) [85] in discrete domains, the univariate marginal distribution algorithm for continuous domains (UMDA c ) [83], and UMDA G c , where a univariate Gaussian density is assumed for each variable; (b) population-based incremental learning (PBIL) [86], where the selected individuals at each generation are used in updating the components of the probability vector using a Hebbianinspired rule, and its continuous version [87], where Gaussianity in each marginal univariate density is assumed; and (c) the compact genetic algorithm (cGA) [88], where only two individuals from the current probability distribution are simulated in each generation, and the process of adapting probabilities towards the winning individual continues until convergence.A general formulation of univariate discrete EDAs that incorporates UMDA, PBIL and cGA is proposed in [89].
Bivariate dependencies.The three seminal works within this category are: (a) mutual information maximization for input clustering (MIMIC), (b) combining optimizers with mutual information trees (COMIT), and (c) the bivariate marginal distribution algorithm (BMDA).In MIMIC [90], searches for the best permutation between the variables are performed in each generation to find the probability distribution p π l (x) that is closest to the empirical distribution of the selected individuals when using the Kullback-Leibler divergence, where This means that the structure to be learned is a chain.In MIMIC G c [83], the MIMIC algorithm is adapted for continuous optimization problems by assuming Gaussianity for marginal and conditional densities.In COMIT [91], a tree structure Bayesian network is learned using the maximum weighted spanning tree algorithm at each generation.
In BMDA [92], the JPD is factorized from an acyclic graph formed by a set of trees, that are not necessarily mutually connected.Each tree is created taking into account the dependencies between pairs of variables that exceed a certain dependency threshold.
Multivariate dependencies.The vast majority of EDAs that belong to this category are based on learning the Bayesian network that best fits the distribution of the selected individuals in each generation and its subsequent simulation.The pioneering EDAs in this area are: (a) the estimation of Bayesian network algorithm (EBNA) [93], [94], where the use of both types of Bayesian network learning algorithms, constraint-based and score-and-search-based algorithms, is proposed.Among the scores used, the BIC and the K2 scores are the most notable.In each generation, the search procedure for EBNAs starts with the model induced in the previous generation.The estimation of Gaussian Bayesian network algorithm (EGNA) [83] is similar to EBNA although a Gaussian Bayesian network is learned in each generation.[95] adapted the EGNA approach for handling high-dimensional problems by controlling the complexity of the learned models; (b) the Bayesian optimization algorithm (BOA) [96], which uses the BDe metric to measure the goodness of each structure in combination with a greedy search algorithm that starts from scratch in each generation; (c) the learned factorized distribution algorithm (LFDA) [97], which controls the complexity of the learned Bayesian network through the BIC in conjunction with a restriction on the maximum number of parents that each variable can have; (d) the estimation of multivariate normal algorithm (EMNA) [83], which assumes a Gaussian JPD, whose vector of means and covariance matrix are estimated by the maximum likelihood method.In [98], an EDA based on the eigen analysis of the covariance matrix and its corresponding tuning strategy is proposed.In [99], an archive with a certain number of high-quality solutions from previous generations is preserved, and the evolution direction provided by the individuals in the archive is integrated into the estimation of the covariance matrix of the Gaussian model; (e) regularization-based EDAs benefit from likelihood regularization during the Bayesian network structure search in each generation.This allows an initial selection of candidate parents for each variable in the graph [100].In continuous domains, [101] proposes the use of regularized model learning of Gaussian Bayesian networks, pursuing sparseness in high-dimensional problems; (f) the iterated density evolutionary algorithm (IDEA) [102] (and its multiobjective version MIDEA [103]) use Gaussian kernel probability density functions, in contrast to mixtures of Gaussians to cope with multimodal optimization problems [104]; and finally, (g) the extended compact genetic algorithm (EcGA) [105] does not need to learn a Bayesian network in every generation to obtain an EDA with multivariate dependencies.In EcGA, the JPD is factorized as a product of probability distributions of varying size.Each group of variables is assumed to be independent from the others.Therefore, a factorization such as p l (x) = c∈C l p l (x c ), where C l denotes the set of groups of variables in the l-th generation, and p l (x c ) represents the marginal (discrete) distribution of variables X c , namely, the variables belonging to the c-th group in the l-th generation, is obtained.

E. Multiobjective EDAs
Multiobjective optimization problems involve optimizing more than one goal, i.e., there are m (m > 1) functions subject to a set of constraints.Optimal solutions are defined based on the Pareto dominance relation.The use of evolutionary computation to multiobjective optimization problems has led to the so-called multi-objective optimization evolutionary algorithms [106].Most of these algorithms, EDAs included, simplify the problem by reducing the m-dimensional space to a scalar value with fitness functions like the convergence indicator, the Pareto-optimal front coverage indicator, the hypervolume indicator and the unary additive -indicator.This is the strategy followed by EDAs based on neural networks [47], [48], [54], [51], on probabilistic models [82], [103], [107] or on a Parzen estimator [108].

III. EDAS IN PREPROCESSING
Optimal ordering of tables.The ordering of rows and columns in a table is a very sensitive issue that affects the readability of the table.Rearranging the rows and columns in a table when their orders are irrelevant reveals interesting patterns that make the table easier to read and interpret.Figure 3 shows an adaptation of the original table introduced by Bertin [110], where the columns represent townships and the rows are characteristics of the townships that are either present (labelled as 1) or absent (0).The cells labelled 0 are shaded red, and the cells labelled 1 are not shaded.The ordering of the rows and columns in Figure 3(a) to Figure 3(c) varies, showing that the examples with a reduced stress value result in more intuitive readability of the information.This stress measure f is an overall dissimilarity measure for the whole table and is computed from the distance of each table cell value to its neighbouring values.The literature shows that this problem has been a cause of concern in statistics for a long time [111].From a technical point of view, the optimal ordering of tables (minimum stress) is equivalent to solving two travelling salesman problems: one for the R rows and the other for the C columns, resulting in a search space with a cardinality given by R! • C!. Mathematically, the optimization problem is min π r (1),...,π r (R),π c (1),...,π c (C) f (π r , π c ), where the optimization variables π r (i) and π c (j) denote the position of row i and column j in a given π r ordering of rows and π c of columns, respectively.
In the EDA approach developed in [109], an individual was defined as x = (x 1 , ..., x R , x R+1 , ..., x R+C ), where x i = k means that the position in the ordering of the original ith row is k, and x R+j = l means that the position for the jth column is l.This double path individual representation is very intuitive, but it is redundant, because there may be different individuals representing the same solution with the same stress value, which can confuse the search process.An alternative representation was designed in the same paper using continuous EDAs, whose simulated real vectors of values were transformed into permutations as the respective order for the continuous individual.Univariate (UMDA and UMDA G c ), bivariate (MIMIC and MIMIC G c ) and multivariate (EBNA and EGNA) EDAs were used in the experiments.
Multivariate discretization.A supervised approach to multivariate discretization based on EDAs (UMDA G c ) was presented in [112].In contrast to many classical approaches, the discretization process is multivariate, that is, all predictor variables are discretized simultaneously, and each discretization is evaluated with a supervised classification method (i.e., it is a wrapper approach).This approach depends on the discretization sequence that contains the cut points for each of the bins (i.e., these cut points are the optimization variables).An individual in the EDA represents a discretization policy that transforms the original dataset into a discretized dataset.The discretized dataset is evaluated by the estimated accuracy (objective function to be maximized) provided by the classifier.k-nearest neighbours, classification trees and naive Bayes were evaluated for this purpose in the original paper.

IV. EDAS IN ASSOCIATION RULE MINING
Association rule mining is used to identify the rules discovered in datasets of transactions with a variety of items by using some measures of interestingness.There are two stages in the popular Apriori algorithm [113].In the first stage, frequent item sets are found, and in the second stage, association rules based on the frequent item sets found are generated.The first stage is the most computationally demanding and is where EDAs (specifically PBIL) have been used [114].The EDA adopts a binary code (1 if the item is selected; 0 otherwise) for optimization variables, where each individual length is equal to the total number of items.The fitness function finds frequent item sets as those with support level greater than the minimum support level and discards the rest.Its value is used to update the probability vector in PBIL.
If the aim is to predict a special variable, the task is called class association rule mining.In [115], an EDA replaces the genetic operators (crossover and mutation) of conventional genetic network programming (a graph-based evolutionary algorithm) to improve the efficiency for generating valid offspring and deal with dynamic environments.EDAs are of UMDA-and PBIL-type.

V. EDAS IN SUPERVISED LEARNING
We are given a labelled data set of n variables forming vector X = (X 1 , ..., X n ), including features from N observations.Let D = {(x 1 , c 1 ), ..., (x N , c N )} denote the data set, where x i = (x i 1 , ..., x i n ), i = 1, ..., N , while c i indicates its label from a class variable C. For regression, C will be denoted Y , the real-valued response variable.
Table I shows a summary of the papers reviewed in this section.Pittsburgh approach [122] UMDA, COMIT, EBNA Pittsburgh approach [123], [124] BOA Fuzzy rule-based systems [125] UMDA, MIMIC, UMDA G c Support vector machines Hyperparameters [127] UMDA G c , BUMDA Artificial neural networks Weights of a multilayer perceptron [129] PBIL Weights of a multilayer perceptron [130] PBIL Weights of a multilayer perceptron [131] UMDA Weights of a multilayer perceptron [132] UMDA G c Weights and hidden biases of a feedforward neural network [133] PBIL Weights and structure of a multilayer perceptron [134] PBIL Weights and structure of a multilayer perceptron [135] UMDA G c Pruning the structure of a multilayer perceptron [136] cGA, EcGA, BOA Design of a fully connected multilayer perceptron [137] UMDA Hyperparameters of a convolutional network [ Symbolic regression [157] UMDA Symbolic regression [158] Denoising autoencoder genetic programming Support vector regression [159] UMDA G c Feature subset selection Selective naive Bayes [161] EBNA Selective naive Bayes [162] EBNA Selective naive Bayes [163] cGA, EcGA, BOA Support vector machines [164] TEDA Logistic regression [165] UMDA G c Classification trees [166] UMDA Naive Bayes [167] FEDA k-nearest neighbours.To classify a query instance x = (x 1 , ..., x n ), the k-nearest neighbours method [116] predicts the unknown class label from the classes associated with the k instances of the training set that are closer to x in the instance space using a simple majority decision rule.The accuracy of this classifier depends heavily on the weight of each variable to compute distances between instances; i.e., the problem is max w1,...,wn Acc(φ), where φ is the knearest neighbours classifier, Acc is the objective function, and the neighbours are determined with a weighted distance: , where w j is the weight assigned to variable X j , and δ(x j , x i j ) measures the distance between the j-th components of x and x i .In [117], a search for these weights the optimization variables was performed with an EDA based on the EBNA approach in a discrete domain (with three different weight values) and based on the EGNA approach in a continuous domain.
Classification trees.Classification trees [118] are expressed as a recursive partition of the instance space.The tree has three kinds of nodes.First, a root node with no incoming edges and several outgoing edges.Second, internal nodes or test nodes, with one incoming edge and several outgoing edges.Third, leaf nodes with one incoming edge and no outgoing edges.In standard algorithms for inducing classification trees, a tree is built in a greedy manner, and a search for the optimal inner leaf is performed at each step.The use of EDAs to this problem allows a more global approach to find the optimal classification tree; i.e., the tree φ * which solves max φ∈CT (n) Acc(φ), where CT (n) is the set of all classification trees built with the n predictor variables in D.
To the best of our knowledge [119] is the only work in which EDAs use the previous fitness function.The authors defined a novel EDA named Ardennes.The individuals are binary trees whose depth is upper bounded by h, a predefined user parameter.The predictor variables (the optimization variables) are selected to fill root and internal nodes according to independent probability distributions associated with these nodes.In the first generation, all the initial probability distributions are uniform.However, in the first generation, the probability of selecting the class attribute C is zero for all variables between the root and depth h − 1.At depth h, the probability of selecting the class is 1.Note that the probability of selecting the class at intermediate levels may increase during evolution since a homogeneous node (all objects belonging to the same class label) is automatically transformed into class nodes.The sampling of nodes is performed on-demand following a depthfirst search: the root node is sampled first, then its left child, the left child of that child and so on.If the class attribute is sampled, the current node is turned into a leaf, and its label is set according to the most frequent class found in the subset of instances reaching that node.Otherwise, if a predictive variable is sampled, we perform a binary split on the instances reaching that node.
Rule induction.The induction of "IF-THEN" rule based classifiers with evolutionary computation has been achieved using two different approaches.In the Michigan approach to classifier systems [120], an individual of the population with a fixed length string is considered, whereas the classifier system itself is represented by the whole set of individuals in the population.In contrast, in the Pittsburgh approach [121], the use of a variable length string is proposed, and each individual in the population is interpreted as a classifier system.The optimization problem is max φ∈PCS Acc(φ), where PCS is the set of all classifier systems built following a Pittsburgh approach.
In [122], an EDA for rule induction that can be seen as an instantiation of the Pittsburgh approach was proposed.The individual representation of the IF part of the rule (the antecedent) consists of the disjunction of simple antecedents (the optimization variables), each with a dimension given by n, allowing each variable to take values that are "equal to", "different from", and "any possible value".Univariate, bivariate and multivariate EDAs (UMDA, COMIT, EBNA) were used in the experiments.A 2-phase adaptation for continuous variables in [123] first uses BOA to model and generate single rules that are then assembled together to form the rule sets.In the second phase, rule sets (classifiers based on rules) are recombined with a procedure similar to a classic genetic algorithm.This improves the effectiveness and efficiency of the rule structure exploration.The same authors add an embedded feature reduction approach to remove irrelevant variables gradually following the evolution of rule population [124].Features are first discarded according to their low frequency in the rule population.Then, more features are removed if they do not belong to the Markov blanket of any other variable, this being found in the Bayesian network learned in the BOA phase.The evolutionary search turns to be more effective.
In fuzzy rule-based systems, rule weights are used to improve their predictive capability.In [125], EDAs (UMDA, MIMIC and UMDA G c ) simultaneously evolve the rules and their weights (both are the optimization variables) and incorporate domain knowledge.
Support vector machines.Support vector machines [126] apply a simple linear method to data, albeit in a highdimensional feature space that is non-linearly related to the original input space.Data, represented as points in space, are mapped such as to leave a gap or margin as wide as possible between separate categories.New instances are then mapped into the new space and predicted to belong to a category depending on which side of the gap they fall on.Mathematically, we aim at finding a hyperplane H : w T x+b = 0 that separates the positive from the negative instances, where vector w is normal (perpendicular) to H.The best hyperplane is that maximizing the margin around H. If perfect separation is relaxed to allow for misclassified points, non-negative slack variables ξ i are introduced and these points are penalized.Thus, the optimization problem results in min w,b,ξ where M controls the trade-off between the size of the margin and the slack variable penalty or errors.This (primal) problem is solved by allocating a Lagrange multiplier λ i ≥ 0 to each constraint and solving the dual problem.Mapping data -via a mathematical projection known as a "kernel trick"-to a much higher dimensional space where there is a linear decision rule is the most used alternative approach.The kernelized dual problem is is the kernel function.The expressions for w and b are derived from the λ i solutions.
In [127], the kernel is fixed as a radial basis function (RBF), whose defining hyperparameters and the penalty factor M are the optimization variables that the EDA uses.Two EDAs are used: UMDA G c and BUMDA, based on a non-Gaussian approximation to each univariate Boltzmann density function.The objective function is the classification accuracy.
Artificial neural networks.Artificial neural networks are computational models for information processing that attempt to mimic the learning of biological neural networks [128].They are inspired by an animal's central nervous system (in particular, the brain) and are used to estimate or approximate functions that can depend on a large number of inputs.Some layers of interconnected nodes, each building upon the previous layer, try to refine and optimize the prediction (forward propagation).The input values for each training instance are weighted and summed at each hidden layer neuron and the transfer function converts the weighted sum into the input of the output node layer.This is repeated through the network H times, where H is the number of hidden layers.Both the network architecture and the weights of each connection between neurons are found in view of maximizing the classification accuracy.Mathematically, the problem is max H,n1,...,n H ,w 0 ,w 1 ,...,w H ,f1,...,f H Acc(φ), where φ is the neural network, n i and f i are the number of nodes and the transfer function, respectively, in the ith hidden layer, and w i is the vector of weights connecting nodes between the (i − 1)th and ith hidden layers.
Early work on EDAs in artificial neural networks focused on evolving the weights of multilayer perceptrons with a fixed topology [129], [130], [131], [132], and the use of this evolutionary computation method was considered as an alternative to the backpropagation method given its local optimality characteristic, stemming as a gradient descent method.In all these works, the multilayer perceptron topology only allowed one hidden variable layer.The individuals of the EDAs were real vectors of dimensions equal to the number of weights of the multilayer perceptron.The EDAs used for the search of the optimal weights were PBIL [129], [130], UMDA G c [131], [132] and MIMIC G c [131].The input weights and hidden biases are the variables to be optimized in [133] in single layer feedforward neural networks, coupling an EDA (PBIL) and the extreme learning machine model.
The simultaneous search for the optimal structure and weights of a multilayer perceptron is a more complex problem, that has also been addressed with EDAs.While PBIL was used in combination with a quasi-Newton method in [134] to optimize the discrete architecture and its corresponding real value weights at the same time, in [ A simpler problem is that of the optimal pruning of synaptic connections from a complete artificial neural network topology, such as that developed in [136] for the case of a multilayer perceptron with a single hidden layer.In this case, the representation of individuals is direct and consists of assigning a bit for each connection between nodes in the multilayer perceptron.The bit denotes whether the corresponding connection is to be used.Three EDAs were used: cGA, EcGA and BOA.
The automatic design of a fully connected multilayer perceptron with only a hidden layer was approached with a UMDA in [137] for time series forecasting.The parameters to be optimized were the number of input nodes, the number of hidden neurons, how the nodes are connected (weights), and the seed used to initialize the connection weights in the backpropagation method.
Deep neural networks [138] contain multiple hidden layers of units between the input and output layers rendering the search for their optimal hyperparameters much more complex than in a multilayer perceptron with a single hidden layer of units.Evolutionary computation algorithms applied to optimize deep learning is called evolutionary deep learning.According to [22], evolutionary deep learning has been used in all deep learning models, roughly divided into convolutional neural networks, deep belief networks, stacked autoencoders, recurrent neural networks and generative adversarial networks.The formulation above would contain additional parameters in the case of deep neural networks, e.g., the number of kernels in the convolutional layer, the kernel size and the kind of pooling layer.In [139], univariate EDAs (with continuous and discrete variables treated separately) were used in convolutional networks for the simultaneous optimal configuration of the number of kernels in the convolutional layer, the kernel size of the convolutional layer, the number of neurons in the fully connected layers, the kind of activation function, and the kind of pooling layer.In [140], a hybrid method based on training gradient-based methods together with EMNA and UMDA G c for weight optimization in convolutional neural networks was proposed.A UMDA G c was applied in [141] to obtain the optimal weights in stacked autoencoders, i.e., several layers of autoencoders, where the output of each hidden layer is connected to the input of the successive hidden layer.
Logistic regression.Logistic regression [142] is a flexible probabilistic supervised classification method that allows the coexistence of discrete and continuous predictor variables, and no assumptions are made about their density functions.The model is formulated as p , where β 0 , ..., β n are the coefficients and C denotes the binary (0/1) class variable, although extensions to the multiclass case do exist.The coefficients are estimated by maximizing the conditional log-likelihood function given by ln L(β|x 1 , ..., The equation system, with n + 1 equations and n + 1 unknowns to be solved to estimate β = (β 0 , β 1 , ..., β n ) that maximize ln L(β|x 1 , ..., x N ), does not have an analytical solution.Itera-tive techniques such as the Newton-Raphson method are used to provide solutions that often become stuck in local maxima.
In [143], EDAs were used to approximate the Pareto front for two objectives (a biobjective fitness function), calibration, as an alternative to the Newton-Raphson method for the maximization of the conditional log-likelihood, and discrimination, to maximize the area under the ROC curve.For each of these objectives, a UMDA G c was designed.The best individuals obtained with each of these UMDA G c s were evaluated in the other objective, i.e., the objective that had not served to guide the search; thus, an approximation to the Pareto front was obtained for the biobjective problem.Each individual in the EDA contained the n + 1 optimization variables, the real numbers representing the β coefficients.
Regularized logistic regression is useful for problems with few samples and a large number of variables.Here, the regularization term needs to be determined.This involves searching for the optimal penalty parameter that represents a tradeoff between likelihood and coefficient shrinkage.In [144], a new regularized logistic regression method based on the evolution of regression coefficients using EDAs was presented.The key issue was the modification of the simulation step in a UMDA G c , whose individuals, as above, represent the β i coefficients, to guarantee their shrinkage via truncated Gaussians as an intrinsic regularization approach.
Bayesian classifiers.In 1961, Minsky [145] showed that the simplest Bayesian classifier, the naive Bayes classifier, in which the predictor variables are conditionally independent given the class variable, has decision boundaries that are hyperplanes.As a result of this very negative theoretical result for naive Bayes, research on Bayesian classifiers did not resume until 1997 when Friedman et al. [146] introduced tree-augmented naive Bayes classifiers.Bayesian classifiers are used to compute the posterior probability of the class variable p(c|x 1 , ..., x n ) from the joint probability distribution p(c, x 1 , ..., x n ), factorized according to the graph structure, i.e., p(x, c) = p(c|pa(c)) n i=1 p(x i |pa(x i )), where pa(c) and pa(x i ) denote values of the parents of C and X i , respectively.In the survey by Bielza and Larrañaga [147], Bayesian classifiers were grouped based on the different factorizations of the joint probability distribution (Figure 4).tional probabilities) that maximize the classifier performance, i.e., max Pa(C),Pa(X1),...,Pa(Xn),θc,θ1,...,θn Acc(φ), where φ is the classifier.
In [148], the interval estimation naive Bayes algorithm was introduced.This algorithm starts with an estimation of the conditional probabilities of each predictor variable given the class variable, using confidence intervals.Subsequently, a UMDA G c is used to search within each confidence interval for the optimal point estimate (the optimization variables) in terms of maximizing the accuracy of the naive Bayes classifier.
In the seminaive Bayes model [149], the naive Bayes conditional independence assumption is relaxed by introducing new variables obtained as the Cartesian product (supernode) of two or more original predictor variables.Thus, the model can be used to represent dependencies between the original predictor variables.The new predictor variables are still conditionally independent given the class variable.The seminaive Bayes modelling algorithm proposed in [149] is based on a greedy search.To prevent the search from getting stuck in local maxima, in [150], the search was carried out with a UMDA.The individuals of this UMDA have n bits (the optimization variables), each one with an integer value in {0, 1, 2, ..., n}, representing the supernode that each variable belongs to (0 means that the variable is not part of the model).
Metaclassifiers.In metaclassifiers, which are also referred to as multiple classifier systems, a set (or ensemble) of classifiers are combined to solve the same supervised classification problem.The stacked generalization, bagging, boosting, cascading and random forest methods are among the most developed metaclassifiers.There are several optimization problems to be solved to obtain the optimal design of the abovementioned metaclassifiers, all with the aim of maximizing the classification accuracy.
Stacked generalization [151] is a generic methodology where the outputs of a set of base classifiers are combined through another classifier.An interesting (combinatorial) optimization problem that arises is the selection of classifiers to be combined from the set of base classifiers.This is called classifier subset selection and is approached with EDAs (EBNA) in [152].An individual in the EDA algorithm has a binary encoding, where each position (optimization variable) refers to a concrete base classifier from the set of candidates, and its values are 1 if the base classifier is used and 0 otherwise.
Boosting [153] builds the ensemble of classifiers incrementally, adding one classifier at a time.In (the standard algorithm) AdaBoost, the original dataset is first sampled from a uniform distribution over the instances, whereas the classifier added at step t is selectively trained on a dataset sampled from a distribution that is adapted at each step.The instances where the preceding classifiers fail increase their likelihood of being in the next sample.AdaBoost uses the same base classifier in all steps.In [154], a UMDA G c was initialized with means equal to the weights of each classifier built by the AdaBoost algorithm.Then, the EDA tries to improve the whole ensemble prediction by evolving the voting weights of each classifier (the optimization variables) for each class label.
Regression.A standard regression task provides a pre-specified model structure (mathematical function relating the response variable Y and the predictors X 1 , ..., X n ) and finds the parameters that best fit a given dataset.Fitting means a minimum (squared) distance of data points from the function (least squares).Parameters are usually estimated with closed formulas (linear regression) or iterative methods (nonlinear regression, as empirical growth curves, exponential models, and Coob-Douglas models [155]).Although the latter require choosing starting guesses, incremental change, and step size, this is usually tackled by trial and error strategies.EDAs seem not to have any contribution in this field of standard regression.
A second appoarch is symbolic regression that simultaneously searches for a model and its parameters.Model search entails moving in the space of mathematical expressions (mathematical operators, analytic functions, constants, state variables) aiming at best fitting a given dataset, both in terms of accuracy and simplicity.Since symbolic regression is an NP-hard problem, evolutionary algorithms are appropriate tools.Two new operators, called α and β, are proposed to represent a mathematical model such that the resulting equations are simplified.EDAs (UMDA G c ) are used to select the appropriate operators and parameters (the optimization variables) [156].The mean square error is used as a fitness function to be minimized.Due to the few parameters used by EDAs compared with other metaheuristics as differential evolution, genetic algorithms, or particle swarm optimization, in [157] a discrete UMDA is proposed to solve the symbolic regression problem.Denoising autoencoder genetic programming is an EDA for genetic programming, where the probabilistic model are denoising autoencoder long short-term memory networks.In [158] this algorithm is used for symbolic regression.
A third approach to regression is based on supervised machine learning techniques.In [159] the parameters of a support vector regression, an extension of support vector machines for regression, are optimized with an UMDA G c .The parameters are the weight vector w and the threshold value b (optimization variables).
Feature subset selection.Feature subset selection [160] is the process of identifying and removing as many irrelevant and redundant variables as possible.This reduces the dimensionality of the data and may help learning algorithms operate faster and more effectively.The resulting model is a more compact and easily interpreted representation of the target concept, and in some cases, the accuracy of future classification may the improved.Two main approaches for feature subset selection have been developed [1]: filter and wrapper.In filter feature subset selection methods, the relevance of a feature, or a subset of features, is assessed from only the intrinsic properties of the data.In univariate filtering, a feature relevance score is calculated according to the class, and low-scoring features are removed.In multivariate filter methods, the subset of features is chosen according to its relevance (with respect to the class) and interfeature redundancy.Then, the subset of selected features obtained with the univariate or multivariate filter method is used as input variables for the classification algorithm.In contrast, wrapper methods evaluate each possible subset of features with a criterion consisting of the estimated performance of the classifier built with this subset of features.Thus, unlike wrapper methods, filter methods screen variables without taking into account the subsequent classifier to be used (see Figure 5).The cardinality space for the multivariate filter and wrapper approaches is 2 n , where n denotes the number of features.The univariate filter is simply a feature ranking.Evolutionary algorithms have also been used to address the feature subset selection problem, mainly because the standard representation of individuals for this problem only requires a binary vector x = (x 1 , ..., x n ), where x i = 1 if variable X i is selected and 0 otherwise.All the papers we have reviewed on the use of EDAs for feature subset selection refer to wrapper approaches, where the optimization problem is formulated as max S⊆{X1,...,Xn} Acc(φ S ), where φ S is the classifier built with the variables in S.
In the seminal paper [161], EBNA was used in combination with a classification tree induction method and a naive Bayes classifier.The naive Bayes classifier was also used in [162] for an empirical comparison among EBNA, two types of genetic algorithms and two greedy algorithms as sequential feature selection and sequential feature elimination; in [163], three EDAs, namely, cGA, EcGA, and BOA, were experimentally compared.In [164], a hybrid method consisting of a genetic algorithm and a UMDA, named TEDA, was applied to problems with tens of thousands of predictor variables.In [165], the EDA process (UMDA G c ) was embedded in an adapted recursive feature elimination procedure of a logistic regression model.Recently a biobjective EDA (a variant of UMDA) was proposed for the feature subset selection problem in intrusion detection, and the accuracy of the classifier (a classification tree in this case) and the number of selected features were taken into account [166].Fast feature selection is a concern in the fast EDA (FEDA) of [167].FEDA does not evaluate all new individuals by the actual fitness function, but with an approximate model based on a special Bayesian network to assign fitness values.A strategy allows to filter subsets of variables that are informative with high fitness values or in an unexplored region.The wrapper approach is then applied (in this case a naive Bayes model) on this reduced set.

VI. EDAS IN CLUSTERING
Table II provides a summary of the papers reviewed in this section.
In cluster analysis, which is also referred to as unsupervised classification, the aim is to group a collection of N objects MIMIC, COMIT, EBNA K-medoids [173] UMDA Affinity propagation [175] UMDA, EBNA Graph-based [176] UMDA Density-based [177] PBIL Probabilistic Bayesian network-based [180] UMDA into subsets, or clusters, so that the objects within each cluster are more closely related to one another than objects assigned to different clusters.Three main types of cluster analysis methods have been developed: hierarchical clustering, partitional clustering and probabilistic clustering.In the three clustering methods, the object grouping process can be seen as an optimization problem, hence the use of metaheuristics such as EDAs can help in the search for optimal clustering.Hierarchical clustering.Hierarchical clustering algorithms [168] organize the objects in a hierarchical structure depicted by a binary tree or dendrogram (see Figure 6).Agglomerative hierarchical clustering algorithms start with N clusters, each of which includes exactly one object.A series of merging operations, designed to group all objects within the same cluster is then performed.Merging operations are applied to pairs of subsets of objects.Iterative searches are performed at each step to obtain the best grouping according to a certain dissimilarity criterion (single linkage, complete linkage, average linkage, centroid linkage, Ward's method, etc.); thus, the number of clusters are reduced by one unit until all objects belong to the same cluster at the end of this process.This dendrogram construction process is only locally optimal and does not guarantee that the resulting binary tree is globally optimal, since the actual formulation should be min D HC d(D HC , D dataset ), where d is an appropriate distance measure, D HC is a dissimilarity matrix of dimension N × N between pairs of objects obtained from the hierarchical clustering output, and D dataset is the dissimilarity matrix obtained from the dataset.dissimilarity x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 In [169], UMDA was applied to provide stochasticity to the subset merging process, promoting the most numerous subsets to have a higher probability of being joined as long as the value of the centroid linkage did not exceed a certain threshold.This restriction means that the constructed dendrogram does not necessarily have to be complete.Partitional clustering.Partitional clustering methods, such as the well-known K-means algorithm [170] (where K is the number of clusters), usually optimize a given cohesion criterion, for example, the sum of the distances of the objects to the centroids of the cluster to which they belong, via an iterative optimization procedure.The general problem is formulated as: min Cl1,...,Cl K K k=1 , where Cl k refers to the k-th cluster, and c k = (c k1 , ..., c kn ) is its corresponding centroid.
EDAs were used in Forgy's version with the object membership representation of individuals [171].In this representation, the EDA individuals are strings of length N , where each position can take integer values from 1 to the number of clusters K.The value in the i-th position (optimization variable) represents the cluster to which the object that occupies that position belongs.Different EDAs, such as MIMIC, COMIT and EBNA were compared empirically.While the centroid is representative of each cluster in the K-means algorithm, in the K-medoids algorithm [172], the representatives of each cluster must be objects that are present in the dataset (an issue that does not occur in K-means).This facilitates the representation of the individuals of an evolutionary algorithm that attempts to approach the problem of the search for the optimal medoids.In this way, in [173], a UMDA approach with an encoding of individuals was proposed based on binary strings (the optimization variables), in which 1 (0) indicated that the corresponding object in the input dataset was (not) considered as a medoid.The K value or number of clusters was also found.
The affinity propagation algorithm [174] is based on the concept of message passing between objects.Its aim is to find a subset of representative objects called exemplars.As in the K-medoids algorithm, exemplars are members of the input dataset.Unlike K-medoids, the affinity propagation algorithm simultaneously considers all objects as potential exemplars, avoiding the selection of initial exemplars.In [175], UMDA and EBNA were used to find the optimal preference of each object.Preferences are a measure of how likely each object is to be chosen as an exemplar, which is considered a highly influential parameter on the result of the affinity propagation algorithm.The optimization variables are the preference values, with three possible standard assignments.
In [176], an attempt was made to address the clustering problem with EDAs (UMDA) by searching for the optimal graph (a minimum weighted spanning tree based on the distance between objects), where the number of nodes equals the number of objects to cluster, and each edge in the graph links two objects that may belong to the same cluster.An individual is encoded with N −1 variables corresponding to the tree edges.Each (optimization) variable of the EDA follows a Bernoulli distribution with a value of 1 if the two objects at the edge must link within the same cluster and 0 otherwise.The parameter of this distribution is inversely proportional to the distance between those two objects.
The use of EDAs to the automatic generation of densitybased clustering algorithms was proposed in [177].PBIL combined eight density-based cluster algorithms as if they were building blocks, thus creating new clustering algorithms.
Probabilistic clustering.Probabilistic clustering is a type of model-based clustering based on fitting the density of all the sample data with finite mixture models.Fitting these finite mixture models requires the estimation of some parameters characterizing the component densities and some mixing proportions.Estimations are based on maximizing the log-likelihood of the data.However, these estimations have nonclosed solutions, and ad hoc procedures, such as the expectation-maximization (EM) algorithm [178], are widely used.In the E-step, the missing data (in probabilistic clustering, this refers to the cluster membership of each object) are estimated given the observed data and the current estimate of the model parameters.The estimates of the missing data from the E-step are used to output a version of the complete data.In the M-step, the complete-data log-likelihood function is maximized under the assumption that the missing data are known.
Bayesian networks provide an intuitive and natural way of performing model-based clustering.It is sufficient to introduce a hidden node representing the cluster variable, H, to yield models with a latent structure where the data are systematically missing.These Bayesian networks can express the probability distribution of the observed data X as a parametric finite mixture model (Figure 7).This has two main advantages: (a) the factorization of the distribution according to the structure of the probabilistic graphical model, and (b) the use of efficient methods for exact inference to provide a probability distribution over the values of the cluster variable H. Probabilistic clustering via Bayesian networks aims to find the structure and parameters of a DAG with similar structure to augmented naive Bayes models of Figure 4, but replacing the class variable C with the cluster variable H.These structures are denoted as G clus BN .Thus, the formulation is max G clus BN ln L(D|G clus BN ), where L(•) is the likelihood of the dataset D given the Bayesian network-based clustering model.In [179], the use of the EM algorithm was proposed to exploit the message passing procedure for inference to efficiently perform the E-step in Bayesian networks.

H X1 X2 X3
Fig. 7. Structure of a Bayesian network with a hidden variable represented by H.The probability distribution of the observed data can be written as a parametric finite mixture model with K components: p(x 1 , x 2 , x 3 ; θ) = K h=1 p(h; θ)p(x 1 |h; θ)p(x 2 |h; θ)p(x 3 |x 2 , h; θ), where θ denotes the parameter vector and K is the number of clusters.
In [180], the use of EDAs based on UMDA was proposed

VII. EDAS IN REINFORCEMENT LEARNING
The aim of reinforcement learning [181] is to develop intelligent agents capable of adopting actions that maximize the expected reward (fitness function) for each state of the system.Many algorithms that search for the policy of actions (optimization variables) to take for each possible state of the system have been developed.In some of these algorithms, the conditional probability distribution of the actions given the agents's state is explicitly determined.In [182], a conditional random field [183] was learned from the selected episodes (individuals providing the best rewards) at each generation of an EDA based on conditional random fields.Conditional random fields are models that naturally adapt to the sequential decision process inherent in reinforcement learning, as they are discriminative models for the classification of sequential data.
Deep reinforcement learning uses deep neural networks to learn policies in high-dimensional input spaces.Deep Qnetwork [184], a type of deep reinforcement learning, is combined with EDAs (PBIL) to strengthen the exploitation and exploration capabilities in a job shop scheduling problem [185].An individual in the EDA has three parts: the scheduling sequence of all operations, the machine assignment of all operations and the processing speed level of each operation.Both maximum completion time and total electricity price are maximized simultaneously (biobjective problem).

VIII. EDAS IN ASSOCIATION DISCOVERY WITH BAYESIAN NETWORKS
The machine learning task of discovering associations between a set of random variables has its major representative in Bayesian network models.EDAs have been used within the field of Bayesian networks for different problems that arise in exact methods of inference or, alternatively, for parameter and structure learning algorithms based on the score and search approach.Table III provides a summary of the papers reviewed in this section.

A. Inference
In [186], one of the most popular algorithms for exact inference, a task that is NP-hard [187], was proposed for multiply connected Bayesian networks.The first step in this algorithm is to moralize the Bayesian network structure, i.e., all variables with a common child are linked and then all edge directions are removed.The resulting graph is called a moral graph.The second step of the algorithm (considered the toughest step in terms of computational complexity) is the socalled triangulation of the moral graph.A graph is triangulated if any cycle of length greater than three has a chord.The resulting structure is then used for evidence propagation and probability computation.The basic technique for triangulating a moral graph (Figure 8) is through the successive elimination of graph nodes.Before eliminating a node and its edges, we check that all its adjacent nodes are directly connected to each other by adding the required edges to the graph (complete subgraph).The nodes are chosen for elimination according to a given order of the variables.The quality of the triangulation is measured by the weight of the triangulated graph, w(S t ) = log 2 ( Cli Xi∈Cli r i ), where Cl i denotes a clique of the triangulated graph S t , composed of vertices X i , each with r i different states.A clique is a subgraph that is complete (all nodes are pairwise linked) and maximal (it is not a subset of another complete set).This weight (objective function) is determined by the order in which the nodes are eliminated and becomes worse if more edges are added.Hence, the search for an optimal triangulation is equivalent to the search for an optimal node elimination sequence (individuals), i.e., the search for an optimal permutation of nodes.In [189], an approach based on recursive EDAs (REDAs) was proposed for both discrete and continuous representation of the variables.REDAs partition the set of vertices (that are to be ordered) into two subsets.In each REDA call, the vertices in the first subset are fixed, whereas the other subset of variables is evolved with a standard EDA.In the second call, the subsets switch roles.The aim of partial abductive inference, also known as the maximum a posteriori (MAP) problem, is to find the most likely configuration for a subset X E of variables in the Bayesian network, which is known as the explanation set.Given a set of observed variables X O = x O , and denoting X U = X \ X O as the set of unobserved variables, we have X E ⊂ X U .In a MAP problem, the aim is to obtain the configuration x E * for X E such that x E * = arg max xE p(x E |x O ).In [190], discrete EDAs with different degrees of model complexity (UMDA, MIMIC and EBNA) were proposed to solve the MAP problem.The individuals in the EDA population represent a possible configuration only for the variables in the explanation set.

B. Learning
For parameter and structure learning algorithms based on the score and search approach, the goal is to max G BN Score(D, G BN ), where Score refers to any criterion based on penalized likelihood or a Bayesian score, see Section II-C.
In the space of DAGs, [191] used two univariate EDAs (UMDA and PBIL) in combination with three different scoring metrics (BIC, K2 and entropy).The individuals represent a connectivity matrix.To prevent the new individuals obtained by simulating the Bayesian network from being cyclic graphs, a repair operator was introduced.Once a cycle is detected in the individual, the repair operator randomly deletes one arc of the cycle.This random deletion is repeated until a DAG is obtained.[192] used a representation of individuals similar to [191] with UMDA and PBIL as algorithms and BIC as the score to be maximized.[193] incorporated a mutation based on matrix transposition in the EDA algorithm flow.This matrix transposition increases the possibility of inferring the correct arc direction by considering the direction of edges in candidate solutions as bidirectional.Four standard EDAs, UMDA, PBIL, MIMIC and BOA, were used together with BDe and BIC as scores.[194] introduced a new mutation operator named the probability mutation.This operator changes the probability distribution learned by PBIL and takes into account that the acyclicity restriction of the graph is fulfilled at the same time.
In the space of possible orderings on the nodes, [195] applied two types of discrete-encoded (UMDA and MIMIC) and continuous-encoded (UMDA G c and MIMIC G c ) EDAs to obtain the best ordering for the K2 algorithm.The K2 algorithm needs a fixed total order between the nodes, as its result is dependent on that order.Therefore, EDAs try to obtain the best among the n! possible orders.For discrete encoding, they used a bijective mapping to represent possible orderings of n variables with n−1 random variables.The simulation step was adapted to output a valid permutation of the variables.This adaptation is also necessary for continuous encoding, where each n-dimensional real vector can be transformed into a valid permutation of the n variables.
Hidden Markov models (HMMs) [196] are dynamic Bayesian networks used to model Markov processes that cannot be directly observed but can be indirectly estimated by state-dependent outputs; in other words, the state is not directly visible but the state-dependent output is.The parameters of these models (transition probabilities between hidden states, emission probabilities of the observations given the hidden state, and initial probability distribution of the hidden states) are commonly learned using algorithms derived from gradientbased methods, such as the Baum-Welch procedure.[197] applied PBIL as an alternative to the Baum-Welch procedure, considering an individual as the concatenation of the three types of parameters.

IX. REAL-WORLD APPLICATIONS OF EDAS IN MACHINE LEARNING
This section shows a number of EDAs selected applications for inducing machine learning models.First, EDAs for association rule mining in traffic flow prediction and in blood index analysis are found in [115] and [114], respectively.Second, in supervised learning, EDAs contribute to finding: the support vector regression parameters applied to software reliability prediction [159]; the parameters of a full connected multilayer perceptron with a hidden layer in time series forecasting problems [137]; the weights in convolutional neural networks for text categorization [140]; the input weights and hidden biases in single layer feedforward neural networks for drought prediction in China [133]; the regularized parameters of logistic regression models in microarray data classification [165]; and in different feature selection problems: of clinical findings in cirrhotic patients treated with transjugular intrahepatic portosystemic shunt [198], of peakbins in mass spectrometry data [199], of nucleotides for splice site prediction in Arabidopsis thaliana [200], of traffic and connection features in intrusion detection [166], and to predict the likely method (negotiation, won at trial, etc.) of settlement for a claim in a legal business [201].Third, EDAs help in gene expression data clustering and biclustering [202] and probabilistic clustering [180].Finally, in [182], EDAs were used in the reinforcement learning problems of perceptual aliasing maze [203], and of flexible job shop scheduling [185].

X. CONCLUSIONS AND FURTHER TOPICS A. Conclusions
We have reviewed work in the literature on the use of discrete and continuous estimation of distribution algorithms to different machine learning problems, ranging from data preprocessing to association rule mining, variable selection, supervised learning (classification and regression), clustering, reinforcement learning and association discovery with Bayesian networks.According to Tables I-III, univariate EDAs are used three times more than complex probabilistic models (bivariate or multivariate models).Some commonalities drawn from the survey are that many taks in machine learning can be posed as combinatorial optimization problems where discrete EDAs are useful, like searching for the Bayesian network structure, the clasification tree, the architecture of an artificial neural network, the feature subset, the operators of symbolic regression, the cluster assignment, or the policy of actions in reinforcement learning.Continuous EDAs may help in continuous optimization problems, like parameter estimation in logistic regression, optimal weights in artificial neural networks, or hyperparameters in support vector machines.In our view, most papers on recent EDAs focus on solving optimization problems without a machine learning oriented goal.Thus, the main message of this survey is that the list of machine learning tasks posed as an optimization problem is really long and there is much room for the EDAs to leverage these challenging problems.

B. Further Topics
EDAs based on Bayesian networks might adopt modern learning algorithms [204] providing better data fitting structures (such as semiparametric Bayesian networks [205], [206], in which nodes estimated by kernels coexist with nodes that assume Gaussianity) and better efficiency could result in the improved performance of EDAs.This has been done in the new SPEDA [207], based on semiparametric Bayesian networks, which turned out to be one of the best performing approaches with respect to other state-of-the-art algorithms in continuous optimization.
The most frequent use of simple EDAs as a starting practice is in fact a recent recommendation of the Dagstuhl Report from the EDAs seminar held in May 2022 [208].In the case of unsatisfactory results, restart strategies should be tried and then ultimately more complex interaction models would be pursued.Furthermore, the same level of dependence among the variables is usually maintained during the EDA run.However, smarter EDAs could change these levels depending on the landscape shape that is visited in each generation.Finally, the interpretability of the probabilistic graphical models learned in each EDA generation is not sufficiently exploited.Mapping the dependencies captured by these probabilistic models to the optimization problem structure might reveal unknown information about the problem [107].New ideas such as search trajectory networks [209] can be adapted for visualizing and analysing EDA behavior.
There is scarce literature on the use of evolutionary computation (EDAs included) in machine learning for data stream scenarios, a situation that is becoming increasingly important in real-world applications.Another important issue concerns computationally demanding fitness functions.The use of surrogate models to estimate them is common practice [208].Within supervised learning, fitness functions are usually estimated with honest methods (repeated holdout, k-fold crossvalidation, boostrapping, etc.).However, an interval estimation rather than a point estimation would be fairer to evaluate individuals.This would also influence their subsequent selection.Some procedures, when preprocessing the dataset, are amenable to the use of EDAs.We can mention the multivariate imputation of missing values, dimensionality reduction (e.g., nonlinear principal component analysis and multidimensional scaling), and visualization issues (e.g., parallel coordinate plots and optimal graph layouts).
The k-nearest neighbours algorithm could take advantage of EDAs in the prototype selection problem, the number of nearest neighbours, the distance function, and the scheme for weighting the nearest neighbours.There are a number of interesting decisions that can be made for classification trees using EDAs, e.g., finding the best combination of variables in the hyperplane that defines each internal split in oblique trees or improving tree pruning.Rule induction based on the Michigan approach has not been found with EDAs.The kernel function in support vector machines could be used as a hyperparameter to be set by an EDA.
A taxonomy of possible optimization tasks in evolutionary deep learning is as follows [22]: (a) In data processing, the problems are how to generate better data, how to achieve better data balance and how to improve the efficiency of data preprocessing to meet the given requirements; (b) in model search, questions consider how to search efficiently for optimal architectures, optimal hyperparameters, and better models satisfying multiple objectives; (c) in model training, key issues are how to efficiently find optimal parameter values, how to efficiently pre-train models to avoid problems caused by the random initialization of weights and how to reduce the training cost; (d) in model evaluation and utilization, pending tasks include how to better evaluate the robustness of models, how to achieve better model ensembles and how to better prune models.Although many evolutionary computation algorithms have contributed to (a)-(d), EDAs have not, thereby opening opportunities in this popular machine learning area.
From the whole family of discrete Bayesian classifiers (Figure 4), only selective naive Bayes and semi-naive Bayes models have benefited from the use of EDAs.In Bayesian classifiers with continuous predictor variables and even with both continuous and discrete variables, EDAs have not been used.This survey has only found two EDA proposals in improving stacking and boosting, but many ensembles, such as cascading, could benefit from EDA-based searches of the number of classifiers and their type in the chain.Furthermore, likewise other evolutionary algorithms, EDAs could be used to improve the diversity of ensembles of classifiers.
In regression, EDAs can be useful for deciding which interaction terms of variables (and their degree) should be considered for a good fitting in terms of some performance measure as the mean squared error.This extends the simple additive effects in standard regression.Also, multi-output regression [210], where many response variables are to be predicted simultaneously, may take advantage of EDAs for searching the parameters of each regression.
Hierarchical clustering with EDAs could benefit from a more global fitness function that can guide the evolution of dendrograms.[211] addressed this topic (although with a genetic algorithm), where an ultrametric distance associated with the dendrogram that fits the dissimilarity matrix of the dataset was sought.[212] also used a genetic algorithm where individuals represent dendrograms, and the fitness function was recurrently defined according to the concepts of homogeneity and separation.Centroid-based representations of individuals are shorter than label/tree/graph-based representations but are not used in EDAs for partitional clustering.In probabilistic clustering with Gaussian mixture models, the structure of the Bayesian network with a hidden node (Figure 7) sought by the EDA may be turned into a Bayesian multinet (with the hidden node as the distinguished variable) to yield different Gaussians for each cluster.Moreover, EDAs can be used to find the parameters of each mixture component (weight, mean vector and covariance matrix) as an alternative to the structural EM algorithm.All EDA approaches found in partitional and probabilistic clustering work with a number of clusters that are known and this could be relaxed.Selecting features (and even weighting them) tailored to the clusters found in highdimensional spaces is an area where EDA contributions are missing.Clustering based on deep learning could benefit from the use of EDAs.Other more sophisticated variants of clustering, such as spectral clustering, biclustering, multiview clustering, ensemble clustering or multipartition clustering, could be approached with EDAs.
EDAs may be helpful in inference problems with Bayesian networks for association discovery, which have an extreme computational complexity.Examples are most relevant explanations, most frugal explanations, MAP-independence explanations, same-decision probability and counterfactual reasoning.There are also opportunities for applying EDAs in Bayesian network structure learning algorithms in dynamic settings.
Figure 9 includes the main topics for further research.Most machine learning problems are multiobjective in nature.The choice of a suitable set of objective functions is not trivial.In association rule mining, several rule interestingness measures, such as support, confidence, comprehensibility, and lift, may be optimized.Different performance measures such as the F1-measure, area under the ROC curve, sensitivity, and specificity, may be maximized simultaneously in a supervised learning method.In multilabel (or multidimensional) classification, the performance measure of each class variable can be considered an objective rather than the common strategy of averaging them.In addition to performance measures, we can take into account model complexity issues, such as the number of hidden units in artificial neural networks, the depth or length in trees and rules, respectively, or the number of parameters in the support vector machine kernel function.Examples of using complexity (number of features) and accuracy [166]; or calibration and discrimination [143] as objectives were mentioned above (Section V).These are the only multiobjective EDAs for solving a machine learning task that we could find.Using multiobjective clustering is convenient to eliminate prior assumptions about the cluster structure that may not hold in the data.Aditionally, several cluster validity indices or the number of clusters may act as multiple objectives.When learning Bayesian networks, optimizing many objectives, such as maximizing the likelihood, minimizing the number of parameters, and reducing the inference complexity of the network structure, may be the goal.Unlike other evolutionary algorithms, EDAs are usually guided by only one objective, and there is much room to explore.
Finally, the recent trend of automated machine learning (AutoML) aims at automating machine learning techniques to expand their use to anyone (laypersons or experts).Wellknown AutoML methods include Auto-WEKA, Auto-Sklearn, Auto-Keras and irace, which are based on existing machine learning libraries/packages.These methods can be used to optimize the integration of different techniques and their hyperparameters for data preprocessing, feature engineering, and learning processes.EDAs might play a key role in this integration.

Algorithm 1 :
Pseudocode for the EDA.D 0 ← Generate M individuals (initial population) at random repeat for l = 1, 2, ... 1. D Se l−1 ← Select N ≤ M individuals from D l−1 according to the selection method 2. p l (x) = p(x|D Se l−1 ) ← Estimate the probability distribution of an individual x being among the selected population 3. D l ← Sample M individuals (new population) from p l (x) until the stopping criterion is met

Fig. 2 .
Fig.2.Example of a Bayesian network model for the hypothetical risk of dementia, taken from[66].
This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/

Fig. 3 .
Fig. 3. Different row and column orderings for the same table of dimension 18 × 32, illustrating different levels of readability.High readability is associated with small stress values.Taken from [109].
135], a variant of UMDA G c This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/within the probabilistic incremental program evolution framework was applied.

Fig. 4 .
Fig. 4.Classification of discrete Bayesian classifiers[147].ODE: one-dependence estimator; TAN: tree-augmented naive Bayes; SPODE: superparent-one-dependence estimator; k-DB: k-dependence Bayesian classifier; BAN: Bayesian network augmented naive Bayes.The optimization problem associated to Bayesian classifiers consists of searching the best parent variables Pa(C) and Pa(X i ) (the structure of the classifier) and estimating the parameters θ i = p(x i |pa(x i )) and θ c = p(c|pa(c)) (condi- This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in IEEE Transactions on Evolutionary Computation.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TEVC.2023.3314105 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/

TABLE III OPTIMIZATION
PROBLEMS SOLVED WITH EDAS IN BAYESIAN NETWORKS