Parallel MOEA Based on Consensus and Membrane Structure for Inferring Phylogenetic Reconstruction

In recent years, inferring phylogenies has attracted lots of attention in both academic community and various application fields. Phylogenetic inference usually consists of a couple of evolutionary relationships, which can be represented as a phylogenetic tree. The phylogenetic reconstruction problem can be defined as an optimization problem, targeting at finding the most eligible tree among all possible topologies according to a selected criterion. Since the combinatorial number of possible topologies exceeds tolerance, various heuristic and metaheuristic methods have been proposed to find approximate solutions according to the selected criterion. However, different criterions are based on different principle and conflict with each other basically. In this line, scholars has proposed multi-objective evolutionary algorithm (MOEA) based on diverse criteria. Nevertheless, MOEA has suffered unbearable time consumption due to its inherent drawbacks of computational complexity and convergence. By studying the independence between the sub-populations in each time-consuming step of MOEA, the steps without global information can be designed to be executed in parallel, which can fundamentally address computational problems. Effective parallel algorithms designed with the characteristics of modern multicore clusters can solve such problems. In this sense, we propose a parallelized multi-objective evolutionary algorithm (MOEA-MC) by deploying on Spark, which added consensus into evolutionary algorithm to improve the quality of convergence and used membrane structure to keep equal solutions under different weights. In order to assess the performance achieved by the proposal, we have performed comparison among different methods on three real-world datasets separately. The results have certified that the solutions derived from MOEA-MC are superior to traditional methods in all studied datasets. And parallelized MOEA-MC can get dominant position and optimal Pareto-frontier simultaneously within minimal runtime.


I. INTRODUCTION
Biological research has gradually attracted the attention of scholars with the explosive growth of the amount of genomic data published in the past few decades. In particular, phylogenetic reconstruction is one of the main research areas of bioinformatics. Phylogenetic inference consists of a series The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou . of evolutionary relationships, which usually be represented as a phylogenetic tree. Phylogenetic reconstruction can be used to describe the evolutionary relationships between molecules, which can promote the research of biomedical, genetic prediction, and economical crop. For example, Zhang [1] constructed Arabidopsis and rice AT-hook proteins into phylogenetic trees which found that AT-hook genes can be divided into five subfamilies with similar structures and characteristics. The publication shows the evolutionary VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ relationships among different organisms which can helps predict the function of rice genes. In addition, the nextgeneration sequencing revolution has brought unprecedented growth in phylogenetic analysis data sets. And phylogenetic reconstruction devoted to reconstruct a biological phylogenetic tree that explaining the evolutionary relationship among a given biological sequence file. The various bioinformatics issues involve complex optimizations, and biologists are committed to finding accurate explanations based on biological principles. This issue motivated the development of effective algorithm design to address current requirements. In this sense, using bio-inspired meta-heuristics to overcome computational challenges [2] have become an increasingly popular method. Phylogenetic reconstruction based on evolutionary and bio-inspired algorithms can be categorized as an optimization problem that finds the best topology among all possible trees based on the selected objective function or criteria. Huelsenbeck [3] explained that the trees which reconstruct according to different criteria may conflict with others, even if they owned the same input. Rokas et al. [4] also pointed out that the selection of criteria has a great influence on the final results. Accordingly, Handl et al. [5] proposed and recommended the application of multi-objective optimization. As [6] mentioned, multi-objective optimization has follow ascendants when compared with the single-objective method : 1) minimize the local minimum and the probability of stagnation in the gradient-free region; 2) reduce the noise impact of the data; 3) introduce multiple sources that conflict with each other which can meet multiple standards concurrently. Therefore, transform phylogeny inference into multi-objective optimization problem (MOP) [7], [8] has taken the mainstream stage. The development of MOP will bring dawn to biologists. Tree generated by MOP are not only supported by different biological principles, but also have high-quality topologies from the perspective of each objective function. The complexity of evolutionary inference has been increased with the new perspective, which has inspired researchers to conduct original research based on heuristic algorithms [9]. Since the emergence of the multi-objective evolutionary algorithms(MOEA), problems involving complex and diverse optimization have transform into finding accurate solutions according several biological principles [10].
According to different selection mechanisms, MOEA can be divided into the following categories: aggregation functions; population-based approaches; Pareto-based approaches. Most of the current considerations are based on Pareto, and the process of a multi-objective evolutionary algorithm based on Pareto is as follows: First, generate an initial population P, and then selected an evolutionary algorithm (such as a genetic algorithm) to perform evolutionary operations (such as crossover, mutation, and selection) on P to obtain a new evolutionary group R. Then construct the nondominated set NDSet of P ∪ R. If the current non-dominated set NDSet is greater or less than the preset size of the non-dominated set N, it is necessary to adjust. On the other hand, the NDSet also need meet the distribution requirement. If met the termination condition, it ends, otherwise copied the individuals in the NDSet to P and the next round of evolution is continued. Pareto-based approaches are relies too much on the selection of shared parameters and generate greater selection pressure, which leads to immature convergence. In addition, each iterations needs to calculate the fitness values of all individuals in the current population, thereby increasing the execution time of algorithm implementation. The two key issues in the implementation of MOEA based on Pareto are: 1) How to make the population search towards the Pareto frontier as soon as possible, that is, the convergence of the population. 2) How to obtain a non-inferior solution with uniform distribution on the Pareto frontier, that is, the diversity of the population. Such as NSGA-II [11] proposed a fast non-dominated sorting, which uses the crowded distance to measure the distribution of solutions and operate selection, but it is complicated to calculate crowded distance. In addition, the computational complexity of NSGA-II is too high in high-dimensional multi-objective problem. MOEA/D [12] converts a multi-objective optimization problem into multiple scalar quantum problems, and each sub-problem consists of a uniformly distributed weight vector. Once a new solution is generated, the solution near the sub-problem is replaced based on the aggregate function. However, evenly distributed weight vector on the unit hyperplane is unable to guarantee uniform distribution of the final solution. These inherent natures have caused the following defects. First, the difficulty of solving the objective function greatly extended the execution time. Second, the convergence of the evolutionary algorithm is relatively poor, and the quality of the optimal solution is low. There have two ways to shorten algorithm's time efficiency, design parallel algorithm and create efficient algorithm which can reach convergence in fewer generations. Our research focus encompasses all of the above.
In this paper, we propose multi-objective heuristics based on consensus and membrane structure, called MOEA-MC, to infer phylogeny with the principles of parsimony and likelihood. And our work takes emphasis on achieving parallelism and convergence simultaneously, which parallelized by deploying on Spark, achieve fine convergence by adding consensus into each subpopulation in evolutionary algorithm. Additionally, to ensure each work node is assigned equal number of trees, we recommend using membrane structure to limit the number of trees in each subpopulation. Membrane structure can also restricted communication frequency between phylogenetic trees under different weights. We have compared MOEA-MC with other biological methods on three nucleotide datasets, and performed multi-objective assessment of biological properties by using several quality indicators and statistical tests. Finally, the rationality of the algorithm design will be verified by comparison with other methods in the literature. The main contributions of this work can be summarized as follows: 1) To develop effective parallel designs, we analyze the working process of multi-objective evolutionary algorithms by identifying computationally intensive operations that do not require global information. 2) A discussion on the main factors that slow down the convergence of that algorithm. We combining the consensus to maintain the topology and achieve accelerated convergence. In addition, a membrane structure is added to each working node to ensure the equal solutions under different weight and control the communication frequency between parallel sub-nodes. The rest of this paper is arranged as follows: Section 2 introduced the materials and methods involved in MOEA-MC. In Section 3, we analyzed how to combine consensus and membrane structures in multi-objective evolutionary algorithm, and showed the pseudo code of MOEA-MC. The related process of parallel MOEA-MC is present in Section 4. The experimental results are discussed in Section 5. Finally, Section 6 summarizes our work and outlines future work.

II. RELATED WORKS
In this section, we depict the intuition and technical details of phylogenetic reconstruction, discuss the reasons why reconstruction phylogenetic development reveals the NP-hard nature [2], [13], and explore how to solve this NP-hard problem [14].
The diversity of creatures in nature reflects the diversity of evolutionary patterns, leading to different representations of species. How to explain this evolutionary process is the goal of evolutionary biologists. Analysis of biomolecular data can account for mutations and replacement events observed at the nucleotide level, which are the source of evolutionary diversity. In phylogenetic analysis, an N × M aligned molecular sequence (N is the number of organism and each one contains M features or sites) is processed to reconstruct the hypothesis of evolutionary events related to this sequence. The evolutionary relationship is modeled by inferring the system tree N × M where branch set E specifies the ancestor relationship between the organisms in node set V . In evolutionary biology, the leaf nodes of a phylogenetic tree are species, or biomolecular sequences or biological entities, but in this paper, the leaf nodes of the phylogenetic tree are all biomolecular sequences, such as gene sequences or protein sequences. Moreover, taxa on the leaf nodes are collectively named Operational Taxonomic Units (OTUs). Correspondingly, the internal node is called Hypothetical Taxonomic Units(HTUs), which represents the possible ancestors of the leaf nodes [15]. The relative distance between objects represents the evolutionary closeness between the objects. The longer the branch length, the more likely it is to mutate.
Accordingly, it can be concluded that the purpose of phylogenetic tree reconstruction is to find the phylogeny T = (V , E) that meets certain biological quality standards. In evolutionary biology, the leaf nodes of a phylogenetic tree can be species, biomolecular sequences or biological entities, but biological molecular sequences (such as gene sequences or protein sequences) are used herein. The leaf nodes on the evolutionary tree are biological objects, the length of branch indicate the kinship distance among leaves and the topology of the evolutionary tree describes the evolutionary relationship of these objects. Evolutionary tree can be divided into rooted tree and unrooted tree according to whether it can represent the evolutionary order. The root of rooted tree is the closest common ancestor of all leaf nodes and the direction of evolution is from root to leaf. The unrooted tree has no root node and cannot represent the evolutionary order between nodes. Reconstruct the possible evolution tree according to a sequence file with n objects, the number of unrooted tree U(n) and rooted tree R(n) can be computed as follows [16]: As the number of species grows, the reconstruction of phylogenetic trees (whether rooted or unrooted) has become an NP-hard problem. For example, given a sequence with 50 objects, we can get 2.84 × 10 74 unrooted trees and 2.75 × 10 76 rooted trees.

A. OBJECTIVE FUNCTION
Reconstruction can be basically divided into four steps. Firstly, get the biomolecular sequence. Thanks to the development of sequencing technology, this can be obtained from major gene banks or biological information databases, such as GenBank, European Molecular Biology Laboratory (EMBL). Secondly, perform data preprocessing such as site alignment. Thirdly, choose one evolutionary reconstruction model which has already emerged in biological, namely the speculation or hypothesis of the evolutionary laws of species. Finally, reconstructed phylogenetic tree by an algorithm based on the evolutionary reconstruction model. Originally, we list all possible evolutionary trees according to the given sequence, and then recommend the best one. With the rapid development of bioinformatics and larger reconstruction sequences, it is inadvisable yet to enumerate all possible trees. Under the optimal standard requirements, the huge amount of computation leads search mechanism to use heuristic technology [17], which can find the appropriate solution for large or even datasets within reasonable runtime. Of course, for affordable small datasets, we can still consider using exhaustive or precise search techniques. The methods to finish the fourth step can be divided into: based on optimal principle and based on no-optimal principle. The former reconstructs a tree with comparable evaluation values, so the best tree can be found. The latter obtains a phylogenetic tree based on algorithmic steps and cannot be compared. Maximum Parsimony Method(MP) [18], [19] and Maximum Likelihood Method (ML) [20] are the two most classic algorithms based on the optimal principle. The latter category usually classified as distance-based methods which uses the difference of the sequence to construct the distance matrix, and then reconstructs the evolution tree, such as neighbor joining (NJ) [21], VOLUME 8, 2020 and Bayesian Inference (BI). Because the former have exact comparable values, most multi-objective optimization methods infer phylogeny by maximizing parsimony and likelihood in literature. In this paper, we consider phylogenetic reconstruction as a dual objective optimization problem involving two widely used biological objective functions: parsimony and likelihood, as reported in the literature. And the parsimony value is obtained using Fitch's algorithm [22], the likelihood score is calculated using the Felsenstein algorithm [23].

1) MAXIMUM PARSIMONY
Using maximum parsimony method to reconstruct phylogenetic trees is first proposed by Camin(1965) [24] and Hein(1990Hein( ,1993 [25]. The principle of the maximum parsimony method is based on the Ockham's razor, which is a philosophical statement that tends to choose simpler than a complex competitive process. In other words, maximum parsimony method follows the principle of minimal change, that is, the fewer mutations or replacement events required for the evolutionary process, the closer to the fact. Given a dataset that have n aligned sequences and each sequence has m features, we can inferring a tree T = (V , E). The parsimony calculation needs to set the ancestor sequence of each node in advance, which can be solved by adopting the bottom-up approach [22]. After assigning the ancestor sequence, the calculation formula for the parsimony score P(T ) of the tree T is defined as [26]: where u, v ∈ V and there have branch (u, v) ∈ E to link them, is an integer value used to quantify the observed mutation events between u and v, and C is the cost matrix, like C i (u, v) indicates the difference between u and v at the i site, and calculated as follows: u i and v i are the sequence state values of the ith character of u and v, After getting the maximum parsimony value of the branch, the next step is to calculate the maximum parsimony value of tree.

2) MAXIMUM LIKELIHOOD
The entry point of the maximum likelihood method is the branch length of the phylogenetic tree. There is a positive correlation between the length of the branch of the phylogenetic tree and the evolution time between the leaf nodes.
And it is obvious that evolution time is closely related to the probability of variation. The maximum likelihood method was originally used to obtain parameters of probability models in statistics. Joseph Felsenstein (1980) first proposed the application of the maximum likelihood method to phylogenetic inference. In phylogeny, for a series of phylogenetic trees reconstructed from a given sequence, the one with the largest likelihood value is the closest to the real phylogenetic tree. Therefore, the likelihood-based phylogenetic tree reconstruction scheme first reconstructs the possible phylogenetic trees, calculates the likelihood values of each phylogenetic tree one by one, and finally considers the phylogenetic tree with the largest likelihood as the optimal. Let D be a collection of n aligned sequences with N characters per sequence (characters can concluded as = {A, C, G, T }). M is an evolutionary model used to describe evolutionary hypotheses, which provides a mutation probability at the nucleotide level and determined the ancestral sequences in advance(such as JC69 [27], HKY85 [28], GTR [29], TN93 [30], K80 [31]). The phylogenetic topology T = (V , E) is a description of the evolutionary hypothesis. The likelihood of T can be calculated as: where L j (T ) = P(D j |T , M ) is the likelihood at character state j and the detailed formula is: where π r j represents the stationary probability for the state r ∈ V appears, when character state r j is defined from alphabet . And C j (r j , r) is the partial conditional likelihood at site j with rooted at node r, and r j ∈ represents all possible state at site j. Let r ∈ V be a HTU which have descendants u and v, then the calculation for C j (r j , r) is: where u j and v j represents character state of the node u and v at site j. t ru and t rv are the branch lengths of connecting node u and v to the node r respectively which are given by (r, u) ∈ E and (r, v) ∈ E. P(r j , u j , t ru ) indicate the probability of transfer r j of the node r to u j of the node u during the evolution time t ru , and P(r j , v j , t rv ) have the same definition. In addition, the value of P(r j , u j , t ru ) and P(r j , v j , t rv ) are all provided by the evolution model M .

B. MULTI-OBJECTIVE EVOLUTIONARY ALGORITHM
A single optimization problem considers only the maximization (or minimization) of an objective function. Differently, multi-objective optimization problems involve multiple targets, and usually conflict with each other. The application of multi-objective optimization in phylogeny represents a hopeful solution to deal with main source of inconsistency that may affect the reliability of phylogenetic reasoning. According to [32], the study of phylogenetic reconstruction can be divided into two aspects. On the one hand, a series of multi-objective evolutionary algorithms have been successfully proposed to solve conflict information in different data sets. On the other hand, other studies focus on solving inconsistencies caused by phylogenetic analysis using different optimal criteria. The most controversial of these is the conflict between parsimony and likelihood. Studies [33], [34] have shown that these two standards may lead to conflicting evolutionary hypotheses.
Hence the need to address potential conflict between different optimal criteria [34], which turn into the main source of inconsistency in phylogenetic research. A way to address this issue involves introduce a multi-objective formulation of the problem. In real world, it is often encountered problems are usually composed of multiple goals or several evaluation indexes that conflict and affect each other. While optimization target exceed one and need meet them simultaneously, called it as multi-objective optimization problem(MOP) [7], [8], can be formulated as follows: where is the search domain, x is the decision variable, m indicate the number of objective functions, and F : → R m , R m denote the solution space [35]. When m = 1, the optimization problem is single-objective optimization problem, if m ≥ 2 called it as multi-objective optimization problem. In general, there are multiple objectives or evaluation criteria for MOP, and each target is mutually constrained. While optimizing one goal, it is at the cost of reducing the performance of other targets. Generally, the multi-objective optimization problem does not have a single optimal solution, but a set of approximate optimal compromise solutions. The traditional optimization algorithm can only obtain a compromise solution in one operation, so the solution efficiency for multi-objective optimization problems is too low to meet the actual application requirements. The evolutionary algorithm(EA) takes the population as the evolution unit which can obtain a set of approximate optimal solutions in one effective iteration [36]. Multiple individuals in EA evolved at the same time, which can reduce the importance of individual that result in reduce the probability of falling into the local optimal ''trap'' [37]. At present, many multi-objective evolutionary algorithms have been proposed, such as representative dominance-based approach NSGA-II [11] and decomposition-based MOEA/D [12] and PhyloMOEA [38]. These classic algorithms were performed significant in this field and usually acted as reference when proposed new work to solve MOP [39].
In multi-objective optimization, there is usually no viable solution that can minimize all objective functions at the same time. In other words, there is no way to improve the solution in any target without lowering any other goals. Therefore, our goal is to search for the Pareto optimal solutions which one have no other solution can dominate it in all objectives. For example, f with different suffixes represents different maximization functions, x 1 , x 2 ∈ , a feasible solution x 1 is dominated by x 2 , if: The points in the objective space corresponding to the Pareto-optimal are non-dominated, and all of them formed Pareto-frontier.

C. CONSENSUS
The concept of consensus have been mentioned in [40], which has introduced that consensus tree can summarizes the topological features of multiple trees and integrates them into single tree. The consensus tree can be divided into several categories (such as strict consensus tree, majority rule consensus tree, loose consensus tree, and greedy consensus tree) [40] according to the integration method. MOEA-RC [41] using the majority rule consensus to retain branch features during evolution, which have certificated consensus can help MOEAs converge in less generations.
Our paper is also picked the majority rule consensus. As MOEA/D [12] depicted that neighbors are likely to have similar search directions. So the number of solutions required to calculate consensus should be greater than 2. In addition, if select all solutions to calculate consensus, the results will be completely homogeneous. It also can result in few elites in the solution and lose the correct consensus. In summary, we chose the suitable number: 3, which can reduce calculation and ensure the reliability of the consensus. In our work, consensus can accelerate convergence when act on crossover and mutation. The consensus branches under different weights are considered as correct branches in the current population, so evolutionary algorithm will protect the topology of consensus in crossover and mutation. This retention can reduce the overall execution of evolutionary algorithm and also speed up searching operation.

D. MEMBRANE STRUCTURE
In 2004, Zhang [12] proposed a multi-objective evolution algorithm MOEA/D based on decomposition. However, Zhang [42] found that the Pareto front lacked diversity. Take researches on MOEA/D found that some (not all) solutions are selected among sub-problems, and there may be many sub-problems corresponding to the same non-dominated solution, which leads to the loss of solution diversity. In order to solve this problem, Zhang [42] designed a multi-objective evolutionary algorithm combining membrane structure to reduce the number of sub-problems and improve the probability that each sub-problem has different solution. In biology, membrane plays a vital role in the structure and function of living cells. Membrane structure can help ensure that a sub-problem will have multiple solutions, where the membrane structure refers to the structure of the membrane computing model. Membrane computing is a branch of natural computing. It is a computational model that is inspired by the structure and function of cells and tissues or organs composed of cells. In the ten years since the concept of ''membrane computing'' was put forward, the computing theory, models, algorithms, and applications of membrane computing have developed rapidly. Membrane computing provides new distributed parallel information processing methods and technologies for computer science, promotes the development of new high-performance computing technologies, and provides a new way to solve computationally difficult problems.
Evolutionary evolution within the membrane eliminates solutions with the worst performance. Therefore, in a sub, the best solution to choose is relatively more. Through multiple iterations, each membrane structure solution is considered to be the best solution to the sub-problems of the membrane structure. Conversely, evolutionary algorithm hold potential capability to be parallelized which have been designed as parallel genetic algorithms(PGAs) [43]. The membrane structure can well complete the evolution inside, and divide all the current individuals into multiple subpopulations. Through the evolution of the subpopulation in the membrane structure, the local optimal solution and the exchange between adjacent membrane structures are used to seek the global optimal solution. Membrane structure can divide the population into specified sizes. Similar to the biological membrane structure, by defining a closed space, the interior can maintain a different biochemical environment than the outside world. Each subgroup is regarded as an cell with unique membrane which can restrict the account of trees in one 'cell' and limit the timing when to exchange maximum, minimum and updates optimal solutions. The specific implementation steps are: 1) Initialization: Divided the object space into multiple membrane structures and the solution for each membrane structure is initialized. 2) Each subpopulation is independent and concurrent, to completes genetic manipulation and evaluates individuals. Determine whether the iteration meets the exchange requirement by the timer which is set by the membrane structure. If reach, replace the worst solution with the excellent solution in the neighbor subgroup through membrane. 3) Iterate through the second and third steps until the appropriate individual is found or the specified number of iterations is completed.

III. MOEA-MC
After above detailing depiction the superiority about consensus and membrane structure, we designed a novel MOEA which integrate membrane structure and consensus. Lemmon [44] have concluded that four trees can generate the optimal consensus. Therefore, we apply every membrane divided into four subpopulations directly, and the consensus is calculated from the optimal solution of the four parts. Each subpopulation develops independently which has own development direction and consensus. Thus they evolves alone with protect consensus through the genetic operators of evolutionary algorithm. The independence of membrane structure, which are suit to decompose, lead us to employ the weighted sum method [45] and decomposed the multiobjective optimization problem into multiple single-objective optimization problems by their weight [46]. Thus, each membrane corresponds to a weight vector. With the previous uniform setting of weight vector, the better distribution of the final non-dominated solution set. Based on the above analysis, we adapted the MOEA/D algorithm by integrating consensus and membrane structure to tackle the phylogenetic inference problem. Algorithm 1 shows the pseudo-code of the MOEA-MC, where D corresponds to a sequence-aligned biomolecule file in PHYLIP format, m is the number of membrane structure, mp and mo are the mutation rates and mutation operator respectively, pc is the probability of perform crossover, ei is the exchange interval and It corresponds to the number of search iterations which are pre-set. The following subsections describe details of the algorithm.  16. end while 17. Return P 1) Initialization: Transform file D into N * 4 * S phylogenetic trees by using a rearrangement method, and generates a well distributed weight vector W m = {w 1 , w 2 , . . . , w m }.
2) Calculate MP and ML: Different objective functions have different values. In order to better measure the pros and cons of the solution on the objective function, each value is standardized as follows.
f i is the normalized result of the i-th objective function among the m objective functions, z * = (z * 1 , . . . , z * m ) and z nad = (z nad 1 , . . . , z nad m ) are the optimal and worst of the m objective functions.
3) Redistribution and Calculate consensus: Then calculate the fitness value of each solution according to the following formula.
where G ws (x|w i ) is the fitness value for solution x under the weight w i , f j is the value of the jth objective function. Its value is the sum of the product of the weight and the corresponding value of each dimension of the objective. And call equation 10 as the weighted sum method which decomposing multi-objective optimization problems into n subs which correspond to w. The population is divided into several subs, and the trees in each sub-population are sorted according to fitness value. According to previous definition, we can computed N the majority consensus and broadcasted to each working node later. 4) Generate descendants: Take binary_tournament_selection on subpopulations to ensure each one have two phylogenetic trees. Perform crossover and mutation on them and generated descendants. 5) Selection: Merge the parent and child. Sort them inside of membrane and eliminate half of the phylogenetic tree with low fitness value. 6) Exchange: Judges whether reach the migration conditions ei. If iteration intervals have arrived, take the migration operation: replace the four optimal solutions on the adjacent with the eight worst solutions on the target. Otherwise, pass. 7) Stop or continue: Determine if the stop condition is met. If it is satisfied that stop algorithm, otherwise returns to step2 and continue the execution.

IV. PARALLEL DESIGN
At present, solving the computationally demanding optimization problems in bioinformatics mainly relies on the combination of biological heuristic algorithms and parallelism. After detailing the main features of genetic algorithm MOEA-MC that can effectively overcome the premature convergence problem of standard genetic algorithm and has strong global search ability, we need to design a reasonable and efficient parallel frame which fit in implement MOEA-MC. In this sense, using parallel platforms or parallel development kits [47] allows us to take advantage of the division of labor and high-speed communication to leverage this architecture in an efficient manner. At present, we have the popular parallel platforms such as OpenMP, MPI [48], Hadoop [49]- [52] and Spark [53]- [56]. With the rapid development of computer technology, the coordination between the subtasks of parallel algorithms has been undertaken by third-party programs. Developers just need to note the parallel mechanism, instead of how to coordinate the work of the cluster. These third-party programs are usually presented in the form of development kits or in the form of a platform. Compared with parallel implementations based on development kits [47], platform such as Hadoop and Spark is more simple to implement and more scalable.
Spark [57] which developed by AMP Labs at the University of California at Berkeley have outstanding features such as high availability, high processing speed and fault tolerance. First, Spark uses an efficient DAG execution engine that can quickly process data streams based on memory. Second, Spark has strong fusibility and can be easily integrated with other technologies. Spark also has its own resource manager and schedulers, such as standalone mode which implements a built-in resource manager and scheduling framework. In addition, compared to the temporary files in Hadoop's local hard disk storage process, Spark uses memory as a temporary storage have greatly speed up the data processing capabilities. Therefore, Spark's parallel and iterative structure is very suitable for information mining of biological data and can confirm to parallel and improve MOEA-MC. Therefore, we will design a parallel algorithm based on Spark, because this combination represents one of the most effective choices for dividing the computer CPU core into multiple working nodes and performing time-consuming objective function calculations in parallel. Follows is the modified and parallelized MOEA-MC.
In order to develop an efficient parallel approach, the first step we must perform is to identify operations that do not require global information. The initialization operation requires the entire sequence file information which is not suitable to parallel. Calculate fitness value can be parallelized because the calculation of likelihood and parsimony do not show a dependency between the phylogenetic trees. Consensus is also only related to the trees inside the membrane structure, so it can be operated in parallel. Generate descendants need parents and the corresponding consensus which not related to other working nodes, so can be carried out in parallel. Merge child and the parent into entirety absolutely can be directly executed by the shuffle operation in Spark. Determining whether to exchange the optimal solution is depends on the iteration interval designed by membrane structure which can also control the information diffusion between subgroups. After theoretical analysis, the most timeconsuming operations in the MOEA-MC can be executed in parallel. Next, using the Spark parallel structure, we can use 'parallelize' in Spark to create RDD which is parallel data corresponding to each step and can be used to set parallel processing operation. In summary, parallelized MOEA-MC is a parallel algorithm which fit in deploying on Spark.

V. EXPERIMENTS AND ANALYSIS
In this section, we conducted a series of experiments to evaluate the performance of parallel MOEA-MC. In addition, we also presented and analyzed the experimental results of MOEA-MC on parallel performance and biological quality.

A. CONFIGURATION
For experimentation purposes, we have used three real-world biological datasets whose details of the sequences and their corresponding sources have been showed in TABLE1. Our experimental platform is one PowerEdge R730 computer with 2.40GHz (32 core) and operating on Ubuntu 5.4.0-6. General Time Reversible evolutionary model (GTR) is used VOLUME 8, 2020 to implement the ancestor sequence in advance. In addition, experimental comparison of various parameter variables of the evolutionary algorithm to find out what input parameter configuration can better improve the quality. TABLE2 lists the common algorithm configurations. And the aggregation function used by MOEA/D in our work is Tchebycheff.

B. PARALLEL PERFORMANCE
First, we have executed MOEA-MC at different parallelism to observe the relationship between the execution time and parallelism. Fig1 shows the runtime of MOEA-MC with 100 iterations on rbcL_55 dataset. By experimenting with the increase and decrease of the degree of parallelism of MOEA-MC on three data sets, we found that the most suitable parallelism is different on different datasets. On ZILLA_500, the optimal degree is achieved when the degree of parallelism is preset as 24, and the best performance in mtDNA_186 was achieved at 32. Therefore, MOEA-MC can get less time with the appropriate parallelism which have proved the effect after deployed MOEA-MC on Spark, and with the parallelism increases that MOEA-MC's execution time gradually decreases until reach its balance.
MOEA-MC was designed to resolve tree reconstruction, so we need take comparison to judge if MOEA-MC can improve the objective values. In order to testify the performance, we tested the maximum parsimony and maximum likelihood of MOEA-MC in three real-world datasets and compared with several classic multi-objective evolutionary algorithms (MOEA/D [12], NSGA-II [11] and Phylo-MOEA [38]). Take experiments on MO-Phylogenetics [59] (which is a tool to infer phylogenetic trees) to got the final maximum parsimony and maximum likelihood value of MOEA/D [12], NSGA-II [11] and PhyloMOEA [38]. In TABLE3, we reports comparisons of the maximum likelihood (ML) with the reference several multi-objective algorithms (MOEA/D [12], NSGA-II [11] and Phylo-MOEA [38]). These maximum likelihood values are all multiplied by −1 in order to make goal become research the minimum of two functions in uniform standard. The other objective MP experiment is show in Table4. Incidentally, the value in each table is all take the best during all iterations. The purpose of MOEA-MC algorithm is to decompose the reconstruction task into multiple workers and calculate the objective function in parallel. Farther, reserved consensus to speed up evolution, and set the membrane structure to ensure that the number of solutions in the working node is not out of balance. But these settings can't achieve more earnings since the various more complex calculations and MOEA-MC in the machine is still running in serial mode. The meaningless results are showed in TABLE3 and TABLE4, which have  certified that only add consensus and membrane can't achieve better performance but parallel design can change this mode. This can also be understood as the reason why parallel algorithms are getting more and more attention on multi-objective problems. The design concept of the MOEA-MC is to achieve the purpose of shorten runtime by using modern multi-core cluster technology. In the TABLE5, it have represents that MOEA/D [12], NSGA-II [11], PhyloMOEA [38] and MOEA-MC run at different datasets have showed a great different execution time. The most outstanding results have been highlighted in bold. Farther, we have annotated the parallelism of MOEA-MC, it have better performance with 24 cores in ZILLA_500 rather than 32 cores in other datasets. Obviously, MOEA-MC is bold in all data sets. Basically, MOEA-MC's execution time is reduced by 50% compared to other classic algorithms. Table 3-5 have also shows the performance of non-parallel MOEA-MC in MP, ML and runtime. It is obvious that the overall performance of the non-parallel MOEA-MC is slightly inferior, even if the ML value obtained in rbcl_55 is less than NSGA-II and MOEA/D, and got the similar output with them on MP. It can be concluded that non-parallel MOEA-MC is worse than parallelized MOEA-MC on all datasets. It is worth mentioned that we have taken all experiments under same environment, and picked the best one as final.  In order to assess that combine consensus and MOEA can improve convergence like [41], we include a comparison with other approaches from literature. Fig 2-5 have clarified that MOEA-MC can achieve better convergence. Figures 2 and 3 have showed the changes of MP and ML on the mtDNA_186 dataset as the iteration progresses. It can be found that the convergence performance of MOEA-MC on ML is better than others, and always been in a dominant position during the iteration process. Although the convergence performance on MP is slightly worse than NSGA-II, it can still maintain the NSGA-II after the iteration on. Figures 4 and 5 show the MP and ML changes of the three algorithms on the rbcl_55 dataset as the iteration progresses. It can be found that the convergence performance of MOEA-MC is better than the other algorithms. Although the convergence speed at the beginning is slightly worse than NSGA-II, MOEA-MC can achieve convergence earlier than NSGA-II.  After assessing biological performance, we now focus on verifying the multi-objective performance of the inferred solutions. The main purpose of this section is to check whether the use of hybrid parallel design results in poor quality of multi-objective solutions. In order to evaluate multi-objective performance, we used the widely used Pareto front indicator. As depicted in section 2.2, there is no optimal solution for multi-objective problems, the goal of multi-objective evolutionary algorithm is to find all feasible solutions in the search space, and then find solutions which are not dominated by another solutions. We can got the optimal Pareto solutions that have no other solutions is better than them through multi-objective evolutionary algorithms.    Also, MOEA-MC is the method that most contributes to the global Pareto-frontier for all data sets. Until now, MOEA-MC has proved it can reduce runtime by parallel and got smaller ML and MP values by unique algorithm components. Also, MOEA-MC got non-dominated Pareto fronts with only 100 iterations. Most importantly, our proposed MOEA-MC outperform on all indicators, indicating apply parallel processing for multi-objective evolutionary algorithms, which can achieve faster, more accurate to referring phylogeny history.

VI. CONCLUSION AND FUTURE LINES
In this paper, we have proposed parallelized multi-objective evolutionary algorithm based on consensus and membrane structure (MOEA-MC). Consensus in each subpopulations can reserve the best topologies that resulted evolutionary algorithm get converged in shorter runtime. By studying the independence between the sub-populations in each time-consuming step of the evolutionary algorithm, the steps without global information can be designed to be executed in parallel, which can fundamentally reduce the execution time. In order to eliminate the imbalance between parallel working nodes, we have used membrane structure to control the solutions number under different weights. In parallel design section, a comparative analysis was carried out between the existing parallel approaches. With the design of the parallel algorithm, MOEA-MC has chosen Spark as parallel tool. Parallelized MOEA-MC also can control the communication frequency between each work node by setting migration interval. With the standalone cluster mode of the Spark, the degree of parallelism is controlled with set CPU cores. Speedup analysis on different system sizes allows us to determine the main factors controlling parallel performance and the appropriate parallelism for different data sets.
Moreover, the analysis of multi-objective results has pointed out that MOEA-MC preserves the search capabilities of the original evolutionary algorithm, giving rise to high-quality sets of Pareto solutions in reduced execution time. By locating the Pareto optimal solution obtained in 100 iterations in the objective function graph, it is obvious that the Pareto front obtained by MOEA-MC can dominate other solutions. In conclusion, our research shows that applying parallel methods can better cope with this huge computing challenge.
Although the results shown in this work are promising, there still are important issues to improve in the algorithm see, e.g. [60]- [63]. As future work lines, we aim to study new parallel approaches such as machine learning [64]- [66] and deep learning [67]- [69] for phylogeny. We will address the development of asynchronous algorithms for pure shared memory environments involving a large number of processing cores. And reduce the shuffle operation in Spark as much as possible.