Deep Neighbor Information Learning From Evolution Trees for Phylogenetic Likelihood Estimates

Likelihood probability based phylogenetic analysis approaches have contributed to impressive advances in minimizing the variance of estimating the evolutionary parameters. However, their actual applications are greatly limited due to the very time-consuming calculations of Conditional Likelihood Probabilities (CLPs). Accurately and quickly obtaining the likelihoods of massive tree samples can facilitate phylogenetic analysis process. Inspired by recent advance of machine learning techniques that greatly improve the performance of many related prediction tasks, this study proposes a Random Forest (RF) based learning and prediction approach, called NeoPLE. The approach initially learns the deep neighbor information between nodes from the topology representations of evolution trees, integrates likelihood information from these trees, and trains a non-linear prediction model. Instead of having to depend on the recursive calculations of the CLPs of tree nodes, NeoPLE transfers the process to a prediction by the trained model, thus the likelihood estimates become irrelevant with the calculations of CLPs. In terms of performance improvement, speedup factors ranging from 2.1 to 3.5X can be achieved on the analysis of realistic data sets. Moreover, NeoPLE is very suitable to handle the data sets having relatively large number of alignment sites, the factor of up to 27.5X can be achieved on the analysis of simulated data sets. In addition, NeoPLE is robust against a wide range of choices of evolutionary models and is ready to integrate in more phylogenetic inference tools. This study fills in the gaps of phylogenetic analysis using a machine learning approach in feature representation and likelihood prediction of evolution trees, which has not been reported in literatures.


I. INTRODUCTION
Evolutionary research at the molecular level addresses two major issues: rebuilding the evolutionary relationships between species and understanding the dynamics and mechanisms of evolutionary processes [1]- [4]. The former belongs to the field of systematics, and traditionally uses morphological traits and fossils to carry out research [5]- [7]. Because of the wide application and easy access of molecular data, it has become the most commonly used data type in reconstructing phylogenetic relationships of most species groups [2], [8]. The problem of the latter molecular evolution mechanism is The associate editor coordinating the review of this manuscript and approving it for publication was Ioannis Schizas . divided into two aspects. One is to estimate the replacement rate of nucleotides and amino acids. On the other hand, sequence data is used to detect mutation and select models [9], [10].
At present, the genetic sequence data are accumulating explosively. However, due to the great improvement of computer software and hardware, the application of complex statistical methods in solving biological problems has also made great progress, so the two main problems in molecular level evolution research have made amazing progress [11]- [16]. There are indications that this progress, especially in terms of data generation, will certainly continue. In other words, phylogenetic analysis has entered the genomic era, and it has become common practice to analyze large data sets containing hundreds or even thousands of species or sequences. However, phylogenetic analysis needs to construct phylogenetic trees [5], [17]- [19]. At the moment, the commonly used methods for constructing evolutionary trees are distance method [20], maximum parsimony method [21], maximum likelihood method [22], [23] and Bayesian method [24], [28]. The latter two methods harness empirical evolutionary models and have better statistical basis, as the minimum variance of parameter statistics can be obtained when sampling size becomes larger. As long as a reasonable and correct alternative model is adopted, a considerable evolution result can be deduced [29], [30]. Nevertheless, the disadvantages are that both methods need to calculate the Conditional Likelihood Probabilities (CLPs) of tree nodes one by one, while the computational cost is very expensive, resulting in undesirably long execution time.
Machine learning methods are suitable for areas with large data, noise patterns, and lack of unified theory [31]- [33]. The basic idea of machine learning is to learn theory from massive data automatically by reasoning, pattern matching or sample learning [32], [34]. Inspired by recent advance of machine learning techniques that greatly improve the performance of many related prediction tasks, this study proposes a Random Forest (RF) based learning and prediction approach, called NeoPLE. The approach initially learns the deep neighbor information between nodes from the topology representations of evolutionary trees, integrates likelihood information from these trees, and trains a non-linear prediction model. Instead of having to reply on the recursive calculations of CLPs of sampling trees, NeoPLE transfers the process to a prediction by a well-trained model, thus the likelihood estimates become irrelevant with the calculations of CLPs. Besides generating the same consensus trees, the speedup factors ranging from 2.1 to 3.5X can be achieved by NeoPLE on the analysis of realistic data sets comparing with MrBayes -a widespread phylogenetic inference tool. Moreover, NeoPLE is very suitable to handle the data sets having relatively large number of alignment sites, as it makes the time complexity of estimating CLPs be O(1) if only increasing the number of alignment sites of data sets. The speedup factor of up to 27.5X can be achieved on the analysis of simulated data sets containing 54 taxa and 8,000 sites. In addition, NeoPLE is robust against a wide range of choices of evolutionary models, as JC69, K80, HKY and GTR models are successfully established in the approach. This study fills in the gaps of phylogenetic analysis using a machine learning approach in feature representation and likelihood prediction of evolution trees, which has not been reported in literatures.

A. OVERVIEW OF OUR METHODOLOGY
The Bayesian inference of phylogeny employs Markov Chain Monte Carlo (MCMC) process to approximate the posterior probabilities of phylogenetic trees, with the Bayes theorem, prior probabilities, stochastic starting-states, calculations of acceptance probability and others. Tree re-arrangements are harnessed in the process recursively for the purpose of generating new combinations of branches and nodes. First, we analyze the essence of the process through introducing the key procedures in likelihood estimates. Next, we explore the associations between tree nodes, branches and the likelihood by decomposing a sampling tree. We then illustrate how to obtain the evolution time representing the features between arbitrary two leaf nodes. Based on the feature matrix representation of the combinations of nodes and branches, a novel adaptive regressor based on Random Forest (RF) algorithm is then established to predict the likelihood of new combined trees.

B. PHYLOGENETIC LIKELIHOOD ESTIMATE DECOMPOSITION
The MCMC process tries to improve the likelihood probability by constantly modifying the branch length or some other evolution parameters. Fig. 1A illustrates two consecutive trees in a sampling chain. Mi+1 is generated from Mi through rearranging the branch length between node 7 and 9 of Mi. In essence, once a branch length representing the evolution time between nodes is updated, the evolutionary models, such as JC69 [34], [35], [35]- [37], K80 [36], [38], HKY [39] and GTR [40]- [42], generate new transition probabilities to modify the evolution process between the nodes. The likelihood probabilities of the ancestral node, i.e. node 9, are then re-calculated by a likelihood function using these transition probabilities in conjunction with the likelihood probabilities of the observable or non-observable states of the descendant nodes. More precisely, a non-observable state means the state is not certain because multiple possible states existing in nodes of extinction (non-leaf nodes), whereas an observable state means the state is certain which can be observed from input sequences (leaf nodes). Fig. 1B depicts the associations between the components, i.e., branch, node and likelihood, of a tree. The node-branch association denotes the evolution time. The node-node association denotes the evolutionary origin and direction. The root-likelihood association reveals the matching degree of the tree and the given model. As depicted in Fig. 1C, for the node that evolves into two non-leaf nodes, the likelihood probability of a possible state of the node, e.g. a single nucleotide residue 'A', is the product of the probability that 'A' evolving into the left and right descendant nodes. For each evolution branch, the probability of the state is the summation of the probabilities that the state evolves into all possible states of the descendant node. Since the likelihood probability is 100% for the observed state and 0% for the rest, the probability of the state only depends on the transition probability between the state to the observed state if the descendant is a leaf node.
Let X denote the input data, x denote the i-th alignment site of X, τ denote the tree topology of Mi+1, Sk represent a residue state of the node under estimation, Sσ i(k) and Sσ j(k) represent the left and right descendant nodes, P represent the transition probabilities between nucleotide states, θ denote the parameters of the evolutionary model calculating P, a comprehensive pruning algorithm [43]- [45] is applied to FIGURE 1. The procedure of likelihood calculation. A Two consecutive trees in a sampling chain; B The associations between node, branch and likelihood; C A node evolves into two leaf nodes and two non-leaf nodes; D The likelihood estimation flow; E The evolution time between two nodes can be calculated by combining the branch length of all the related nodes; F The characteristic distance matrix is constructed based on the molecular evolution tree.
calculate the likelihood probability of Sk for each arbitrary internal node: where P(v, θ ) represents the transition probabilities between ancestral and descendant nodes over a branch length of v under the specified evolutionary parameters θ . Let π denote the prior base probabilities of residues, the overall likelihood of the tree for site x is then obtained by (2).
Fig. 1D depicts the likelihood estimation flow. The likelihood probability of the entire data X is the product of the probability of each site, hence the log-likelihood ratio is equivalent to the summation of the likelihood probabilities of all sites in X.

C. FEATURE MATRIX PRESENTATION
As above, the calculation of the likelihood probabilities depends on the tree topology and the transition probabilities of the related branches. The former determines the relationships between nodes, while the latter depends on the setting of hypothesis parameters in different evolutionary models, in which, the evolution time are most closely related. Thus, the relationships and branches can be selected as the features representing an evolutionary tree which can be used by machine learning algorithms, e.g. training. Fig. 1E then indicates that the evolution time between two nodes can be obtained by accumulating the branch length of all the related tree nodes. For instance, the evolution time between node 1 and 5 is the summation of t1, t9, t13, t14, t11 and t5. Based on this consideration, the feature matrix of each sampled tree can be constructed as shown in Fig. 1F. The value of elements in the matrix denotes the distance between leaf nodes. To verify whether the likelihood of new tree samples can be predicted by the distance matrix (Fig. 1F), we extract the upper triangular elements of the matrices that are created from 50 sampling trees, and harness Principal Component Analysis (PCA) method to carry out the dimension reduction on the features so that the trend of the reduced features can be compared with the log likelihood of these samples in the Cartesian coordinate system. As illustrated in Fig. 2A, the very similar trend between the curves of the features and the log likelihoods indicates that there are implied connections between the extracted features and the likelihoods of these sampling trees. The scatter plot in Fig. 2B further demonstrates the strong linear relationships between the feature of decreased dimension and likelihood.

D. CART AND BAGGING
The Classification and Regression Tree (CART) is a treebased classifier. Based on minimizing Gini coefficient, the construction of CART starts by searching all attributes in the training samples and selecting the features which can split the whole data set into two portions with the lowest impurity among the samples within each subset. Then, the process is carried out recursively for the rest subsets. The first selected feature is deemed as the root node and the division process continues until the subset only contains a single sample. In order to improve the generalization ability, CART needs to be pruned before it is transformed into final decision rules. The bagging algorithm [43], [44]   classification performance. Let (x1, y1), (x2, y2), · · · , (xn, yn) denote the samples in training data set D, R denote the category of these samples, the impurity function of CART for all samples in R can be calculated by (4). For each split, the optimization target is to search splitting feature and value, i.e. j and sin (5), that can minimize the impurity function L. Fig. 3 illustrates the modeling process on the data set that is transferred from N trees. At first, Bagging is performed on the group of N samples. The bootstrap sampling results in S new groups. These new groups contain equal-size samples, but different features. The number of features in new groups is much less the original group. Next, each group of data set is trained using CART algorithm to get a weak classifier. For instance, the process on a group of 5 samples {x1, x2, x3, x4, x5} and 3 features {f1, f3, f5} is shown in Fig. 9C. For feature f1, the splitting into the group of {x1, x2, x4} and the group of {x3, x5} using a21 can minimize the impurity. Similarly, the splitting value for f3 and f5 are a33 and a45, respectively. By comparing the three impurities, f5 and a45 are selected to be the splitting feature and value. As a consequence, the process makes the left derived node contain the samples of {x1, x2, x4} and the right derived node contain the samples of {x3, x5}.

E. TRAINING AND PREDICTION
Currently, there are many kinds of integration libraries implementing the basic machine learning algorithms. The proposed method focuses on neither presenting the detailed principles of these algorithms nor giving the skills of adjusting the parameters when using them. Instead, off-the-shelf python library implementing RF, i.e., RandomForest-Regressor in sklearn, is used. Fig. 4 depicts the implement-tation flow of prediction. At first, let (X, Y) denote the data set, where X is the data set composed of the upper triangular elements of N square matrices, Y is the data set composed of the likelihood corresponding to N square matrices. The data set (X, Y) is then divided into training set (X_train, Y_train) and testing set (X_test, Y_test). The training set (X_train, Y_train) is input into the Random Forest Regressor for training. Once the training process is completed, the data set in X_test is predicted and recorded as Y_predict.

F. GOODNESS OF FIT
The goodness of fit is a benchmark to measure the fitting degree of the regression line and the observation value. The statistic for measuring the goodness of fit is the determinable coefficient (also known as the deterministic coefficient), and the maximum value of is 1, and the closer the value of is to 1, the more the regression line fits the observed value. On the contrary, the smaller the value of, the worse the regression line fits the observed value. Let denote the value to be fitted, denote the mean value, denote the fitting value, (6) is used to calculate the Regression Square Sum (RSS), (7) is used to calculate the Error Square Sum (ESS). Since TSS is the  The fitting curve between the likelihood probabilities that are calculated by likelihood function and the one predicted by NeoPLE. X-axis represents the number of sampling trees, while Y-axis is the likelihood probabilities corresponding to each sample. The red curve stands for real value and the blue one is the predicted value. It can be seen that the blue curve fits the red curve well.
summation of RSS and ESS, is then calculated by (8). Fig. 5 depicts a fitting curve, where the goodness of fit is 0.9786.

III. RUSULT
The evaluation of NeoPLE is composed of two aspects. The first one is to evaluate the time cost of NeoPLE comparing MrBayes when analyzing different lengths of data sets. The elapsed time of producing the likelihoods of the same number of tree generations by NeoPLE and MrBayes is defined as tp and tc, respectively. The efficiency of NeoPLE is then obtained through comparing tp and tc. The other one is to evaluate the accuracy (acc) of NeoPLE through employing different evolutionary models (model), alignment sites (nsite) and sampling tree generations (ngen).

A. EXPERIMENTAL DATA SETS
The experimental data sets are composed of realistic nucleotide sequence data sets and simulated nucleotide sequence data sets. The former is obtained directly from the package of MrBayes v3.1.2, where data sets of 12, 17, 29, 37, 54 taxa are included, respectively. The latter is generated by Seq-Gen [50] for the purpose of obtaining the data sets with a fixed number of taxa, e.g. the same number of taxa as the realistic data sets, but different number of alignment sites. For analyzing each data set, ngen is determined by the average standard deviation of split frequencies representing the degree of similarity between two independent analyses.
In the experiments, the sampling chains are stopped when the value was less than 0.01. Hence, ngen for these data sets is set to 12000, 20000, 50000, 70000, 400000, respectively. Fig. 6 illustrates the comparisons between tc and tp when analyzing realistic data sets. The nucleotide substitution models are set to GTR, K80, JC69 and HKY, respectively. Apparently, tc is dramatically increased with the increase of the number of taxa, which causes a relatively large consumption of time and hinders the analysis of very large species or sequences to some extent. Precisely, NeoPLE outstrips MrBayes with speedup factors from 2.1 to 3.5 when using GTR model, and the speedup factors are similar for the other three models. Consequently, the experimental results demonstrate that NeoPLE outstrips MrBayes in all cases. Moreover, Fig. 7 illustrates the performance comparisons between tc and tp when fixing the evolutionary model to GTR and the number of taxa to nucle12, nucle17, nucle29, nucle37, and nucle54 respectively. Note that the experiments use the simulated data sets with different number of alignment sites ranging from 1,000 to 8,000. As shown in the figure, tc is significantly increased with the increase of alignment sites. Nonetheless, the result produced by NeoPLE is pleasantly surprised as the time cost is constant no matter how the number of alignment sites changes. For instance, tc is markedly growing with the increase of the nsite when the data set is nucle12 in Fig. 7A, but tp almost remains the same. The more obvious is that tc increases rapidly from 495 to 5,300s when increasing the number of taxa to nucle54 depicted in Fig. 7E, while tp still remains 222s. Under these circumstances, the time gap between tp and tc is amazing.

C. CONSENSUS TREE COMPARISONS
The aim of Phylogenetic analysis is to obtain reasonable molecular evolutionary trees, thus a consensus tree needs to be established at the end of each analysis to verify whether VOLUME 8, 2020 the evolutionary settings are acceptable. The accuracy of the proposed NeoPLE is evaluated using the aforementioned realistic nucleotide sequence data sets and the posteriori probability is used to evaluate the credibility of the topological structure produced by MrBayes and NeoPLE. Fig. 8 illustrates the comparisons on the consensus tree that are produced by NeoPLE and MrBayes, For the sake of simplicity, we only include the consensus tree of nucle54 in the figure.
Experimental results in the figure demonstrates that NeoPLE produces the same phylograms and almost exactly the same clade credibility values.

D. DISCUSSIONS
Beside evaluating the accuracy of NeoPLE by the comparisons of consensus trees, the relationships between model, nsite and acc is included. Fig. 9A illustrates the relationships between model and acc when setting the number of alignment sites to 1000 and the number of generations to the maximum. The curves of different colors represent the different data set.
The results indicate that the evolutionary models have no evident effect on the accuracy of likelihood predictions as the ordinate values of different color curves basically fluctuate between 0.93 and 0.94. Fig. 9B illustrates the relationships between nsite and acc when setting the nucleotide substitution models to GTR and the number of generations to the maximum. Experiments show that the number of alignment sites has obvious effect on acc, as the value of acc increases with the increase of nsite.
The relationship between ngen and acc is illustrated in Fig 10. In order to make the results more comparative, the number of tree generations is gradually increased in each time, in which nucle12, nucle17, nucle29, nucle37, nucle54 is increased by 1000, 2000, 5000, 7000, 40000, respectively. The experimental results demonstrate that the prediction accuracy of the model is greatly improved with the number of tree generations increases. The curves of different colors represent different number of alignment sites. It can be seen that different curves show an upward trend. Therefore, increasing the number of generations can improve the accuracy.
Overall, the cases of deep neighbor information, likelihood and random forest algorithm are employed to train a novel non-linear prediction model, called NeoPLE, for phylogenetic inference in this study, NeoPLE is successfully integrated in MrBayes, which brings performance improvements by two major benefits relative to the traditional approaches based on exhaustive CLP calculations.
First, a new abstract relational schema describing the relationship between tree structure and its likelihood probability transfers the original likelihood estimate of a tree into regression-based prediction so that the likelihood of each individual sampling tree in the process of phylogenetic inference can be obtained directly and quickly, which substantially  reduces the analysis cost. Second, the time cost of completing a phylogenetic inference depends on two key factors, which are the number of taxa and the alignment sites per each taxon. The former determines the sampling size, while the latter adds the burden of time cost calculating the CLPs for sites. Under the new schema, the process of training model, as well as performing a prediction, only relates the number of taxa, which is independent of data set length however. This is due to the efficient and effective feature extraction strategy based on deep neighbor information. The prediction model established using tree branch length between arbitrary tree nodes demonstrates the robust adaptation even if changing FIGURE 9. A The trend of acc changes with the change of nucleotide substitution models. The model of the x-axis represents the nucleotide substitution models, such as GTR, JC69, K80 and HKY. The y-axis is the accuracy of model prediction, which is represented by acc. Curves of Different colors stand for different number of taxa. Although the line represented by nucle54 decreases significantly when the abscissa is 2, for other line segments, the value of acc can hardly change with the model. So, model is not a necessary factor to acc. B The trend of acc changes with the number of alignment sites. The nsite represents the number of alignment sites. In the experiment, we set the number of sites to 500, 1000, 2000, 4000, 8000. The y-axis is the accuracy of prediction, which is represented by acc. It can be seen that the value of acc increased with nsite. When the number of sites between 0 and 2000, the value of acc increases significantly. In addition, the titles represent different taxa, such as nucle12, nucle17, nucle29, nucle37 and nucle54. It can be seen that when the taxa is nucle12, no matter the number of alignment sites is 500, 1000, 2000, 4000 or 8000, the value of acc increases with the increase of ngen. The same is true for other taxa. Moreover, the final value of acc was between 0.92 and 0.96. data set length. The benefit in Fig. 7 illustrates increasing data set length does not incur overhead as the time complexity O(1) is achieved.

IV. CONCLUSION
To conclude, the model-based assessment of likelihood usually leads to undesirably long execution time. To overcome, this study makes the full use of candidate trees and successfully establishes a model that exactly describes the relationships between the likelihood and the extracted features through exploiting the deep neighbor information of each individual tree. Instead of having to depend on the recursive calculations of CLPs node by node in traditional phylogenetic inference tools, the proposed NeoPLE mines knowledge from trees calculated and trains a prediction model, thus the likelihood estimates of the rest trees are directly determined from the model. A significant performance improvement in reducing time complexity demonstrates that NeoPLE is a powerful tool for biological evolutionary research. The use of this method is expected to accelerate the progress of phylogenetic research and help uncover the secrets behind complex biodiversity. Future work on NeoPLE will comprise the achievement of performance and accuracy improvements, as well as the integration of more phylogenetic inference tools.