Stacked Ensemble for Bioactive Molecule Prediction

Bioactive molecular compounds are essential for drug discovery. The biological activity of these compounds needs to be predicted as this is used to determine the drug-target ability. As ineffective drugs are discarded after production, leading to resource and time wastage, it is important to predict bioactive molecules with models having high predictive performance. This study utilizes the stacked ensemble which uses the prediction of multiple base classifiers as features, used to train a meta classifier which makes the final prediction. Using three datasets DS1, DS2, and DS3 gotten from MDL Drug Data Report (MDDR) database, the performance of stacked ensemble was compared to three other ensembles: adaboost, bagging, and vote ensemble, based on different evaluation criteria and also a statistical method, Kendall’s W test. The accuracy of Stacked ensemble ranged from 96.7002%, 98.2260% and 94.9007% for the three datasets respectively, although Vote had the best accuracy using dataset DS2 which consist of structurally homogeneous bioactive molecules. Also, using Kendall’s W test to rank the ensembles, Stacked ensemble was ranked best with datasets DS1 and DS3, with both having a mean average of 4.00 and an overall level of agreement, W, of 0.986 and 1.000 respectively. Using dataset DS2, it was ranked after Vote and Adaboost with mean average of 2.33 and an overall level of agreement, W of 0.857. Stacked ensemble is recommended for the prediction of heterogeneous bioactive molecules during drug discovery and can also be implemented in other research areas.


I. INTRODUCTION
Bioactive molecular compounds are substances with positive effects on living organisms. They have toxicological and pharmacological effects on humans and animals. These compounds can either be found in plants, fruits, nuts, or synthesized. Various compounds exist in a plant, including bioactive compounds. These may be present in any part of the plant and they need to be extracted for further use [1]. Some bioactive compounds can also be found in marine ecosystem such as marine sponges [2], [3]. Nopal cactus or prickly pear fruit is also a plant which has been used in both traditional medicine, and for its bioactive compounds properties [4]. Examples of bioactive compounds are flavonoids, carotenoids, and polyphenols.
The associate editor coordinating the review of this manuscript and approving it for publication was Mahmoud Barhamgi.
Bioactive molecular compounds are of high importance in the drug discovery process, and it is also important that the model used in predicting these bioactive molecules are accurate enough to avoid unwanted error which leads to inability of the drug to meet the condition for which it was developed. The introduction of ensembles to prediction has allowed models to combine the performance ability of more than one classifier to improve the performance of the overall model. Ensembles can be likened to the productivity which emanates as a result of teamwork compared to a lone work.

II. RELATED WORK
The combination of two or more machine learning methods have shown great prospect at outperforming that of single method. Application of multiple classifiers [5]- [7], otherwise known as ensemble or wisdom of crowds has resulted in better accuracy and performance compared to single classifiers. As no algorithm can be said to be the most accurate, and since different classifiers are associated with different algorithms, representations, parameters, problems, and training sets, problems associated with each classifier used is shared by the ensemble. Stacked ensemble, also known as stacked generalization is an ensemble method proposed and introduced by Wolpert [8] in 1992. It has been implemented in some areas of science for prediction, but its implementation is yet to be found in bioactive molecule prediction. He et al. [9] utilized stacked generalization in extracting literatures which are related to drug-drug interaction from the numerous literatures available. Although the f-score achieved from the study could be considered low, it was a remarkable improvement compared to previous f-score from other systems. Also, Anifowose et al., [10] implemented an homogeneous stacked ensemble using Support Vector Machine (SVM) in the prediction of petroleum reservoir characterization. Using bagging method and random forest (RF) with support vector machine also, it was reported that in most cases, stacked ensemble with support vector machine had better performance than the other methods and that better result was achieved compared to using support vector machine only. Gupta and Thakkar [11] also made a comparative study on three metaheuristic algorithms which has been used in optimizing stacked ensemble. These algorithms are: genetic algorithm, particle swarm optimization, and ant colony optimization, although it was reported that particle swarm optimization can provide better results. In a previous study [12], artificial bee colony was implemented with 10 datasets and its effectiveness in the selection of metaclassifier and base classifier was shown. Stacked ensemble has also been used in analyzing teenage distress [13], and brain related study [14].
The effectiveness of stacked ensemble can also be introduced to bioactive molecule prediction. This study which implements stacked ensemble will be compared against some other ensembles which has been previously used in bioactive molecule prediction.

A. BAGGING
Bootstrap Aggregating, also known as Bagging, is an ensemble method which is capable of handling both classification and regression. It is a parallel ensemble which reduces both variance and overfitting but does not reduce bias.
Bagging method works by creating m number of bags, and each bag contains a subset of the original data. If the original dataset consists of n different instances, n 1 of the original data is taken with replacement into each bag. Replacement implies that the same instance or data point can be chosen more than once since it is a random selection, and an instance can be several other bags. Therefore, all m number of bags contains n 1 different instances which are all chosen with replacement. According to the rule of thumb, n 1 is usually about 60% of n, that is, each bag contains about 60% of the original data.
In the case of this study, n 1 was 100% of n since it gave a better result compared to 60% and, the number of bags m, created is 10. Each data in each bag is used to train the model using a classifer (in the case of this study, random forest was used as the base classifier). Therefore, the output X , which is the result from each model is used to determine the overall result Y using either the mean of all the output for regression, or the mode of all the output in the case of classification.
Bagging has also been used for intrusion detection systems [15], classification of arrhythmia [16], forecasting the flow of traffic [17], and also in conjunction with boosting [18], where the two ensembles have shown capacity of having better performance than single classifiers B. ADABOOST Adaboost, developed by Yoar Freund and Robert Schapire [19], [20], is the short form of Adaptive Boosting, and it is a meta-algorithm which is the commonly known and used form of Boosting. Boosting is a fairly simple variation on bagging ensemble. It tries to improve learners by focusing on areas where the system is not performing well. Adaboost is a sequential ensemble method. It builds multiple models of the same classifier each of which learns to fix the prediction errors of the prior model in the chain. The classifier's output is combined into a weighted sum and this denotes the final output of the boosted classifiers. This boosting method is called adaptive because subsequent weak learners are tweaked in favor of misclassified instances.
Adaboost weighs instances in the dataset by how easy or difficult they are to classify, and this allows the algorithm to pay more attention to them when constructing subsequent models. The wrongly classified instances get high weight and each subsequent boosting learns a new classifier on the weighted dataset. The classifiers are then weighed to combine them into a single powerful classifier. Those classifiers with low training error rate have high weight. The process is halted with cross validation. Adaboost has shown better performance compared to Support Vector Machine, Artificial Neural Network, and K-Nearest Neighbor in the analysis of the Structure-Activity relationship of phenol [21], and also improved classification time and accuracy in the classification of pecan defects [22].

C. VOTE ENSEMBLE
Vote ensemble can simply and effectively used for problems associated with classification. This method works on multiclassifiers to reduce misclassification. It has demonstrated significant performance in enhancing the accuracy of predictions. It has been applied to a number of areas and impressive results have been achieved. According to Yan et al. [23], voting based method has shown effectiveness in managing incomplete data without making presumptions about missing values. Vote ensemble utilizes several combination algorithms to makes it predictions. These combination rules include: Average Probabilities, Minimum Probabilities, Maximum Probabilities, Product of Probabilities, and Majority Voting [24], . It creates series of classifiers and then predicts VOLUME 7, 2019 based on either the mode or mean of the base classifiers. Majority Voting has been used majorly for prediction as the output of the ensemble or classification is the label with the highest number of votes from the base classifiers [24]- [28]. It can also be weighted [30], that is, assigning more weight on classifiers which are more likely correct [31].
Choosing classification methods with uncorrelated predictions is essential. It is known that Bayesian, Decision Tree (DT) and K-Nearest Neighbor (K-NN) should be selected in order to have classifiers with diverse predictions. It has been used for clustering in chemoinformatics [32], [33], and detection of mislabelled data in bioinformatics [34]. This study is carried out using majority voting as the combination rule.

III. EXPERIMENTAL DESIGN
SVM, DT, K-NN, and RF were used as base classifiers for Stacked ensemble and Voting ensemble. Random Forest, due to its high performance in classification [35]- [41] was used as the meta-classifier in the Stacked ensemble to enable better prediction, and it was also used as the base classifier for Adaboost, and Bagging ensemble.
Waikato Environment for Knowledge Analysis (WEKA), a software written in java, for machine learning was used as the cross-platform with which analysis was made.

A. DATASETS
The study was implemented using three datasets obtained from the MDL Drug Data Report (MDDR) which has already been converted to Pipeline Pilot's ECFC_4 fingerprints and folded into a size of 1024 element fingerprints. These datasets have been used extensively in previous chemoinformatics researches such as virtual screening [42]- [44], and prediction of bioactive molecule [45], [46].
The bioactivity of the molecules was the class on which prediction was made. DS1 consist of 8294 structurally heterogeneous and homogeneous bioactive molecules and 11 classes. DS2 consist of 5083 homogeneous bioactive molecules with 10 classes and DS3 consist of 8560 heterogenous bioactive molecules with 10 classes also. Each dataset was divided into training and testing data with 70:30 ratio respectively. The three datasets are described and used in [42]- [44].

B. STACKED ENSEMBLE MECHANISM
Stacked ensemble is in the form of a two-level procedure. It deals with multiple classifiers and, also, a metaclassifier. On the first level, several base classifiers are introduced to make predictions. This can either be homogeneous, that is, same type classifier, or heterogeneous, several types of classifier. In this paper, an heterogenous set of classifiers consisting of Decision Tree, Support Vector Machine, Random Forest, and K-Nearest Neighbor were used. On the second level, the outputs from the base classifiers are introduced to the metaclassifier, Random Forest in this paper, which is trained based on these outputs and learns the best way to correct the errors from the base classifiers before generating the final prediction.
It is imperative that the base classifiers produce different predictions which are uncorrelated. Better performance is achieved using different algorithms. Basically, stacked ensemble follows the following procedure: 1) Several base classifiers are trained using the training data. It is better to select different base classifiers with different algorithms 2) A meta-classifier is then trained based on the predictions from the base classifiers. This meta-classifier learns when the base classifiers are right or wrong. 3) Final prediction is made by the meta-classifier using the base classifiers prediction for the testing data. The algorithm for stacking is shown in Algorithm 1 and the mechanism is depicted in Figure 1.
Learn a base classifier S t based on D 5. end for 6.
Step 2: Construct new data sets from D 7. for i ← to m do 8.
Construct a new dataset that contains . . , s T (x i )} 9. end for 10.

IV. RESULTS AND DISCUSSION
The performance of three ensembles -Voting, Adaboost, and Bagging were compared to the performance of the stacking ensemble based on the accuracy, sensitivity, specificity, precision, recall, and f-measure. For ensembles requiring more than one base classifier, such as Stacked, and Vote ensemble, Support Vector Machine, k-Nearest Neighbor, Decision Tree, and Random Forest were used as the base classifier, while Random Forest being a classifier with high performance was used for Adaboost and Bagging ensembles and also as the meta-classifier for stacked ensemble.
The performance of each ensemble based on the generally used evaluation criteria are shown in Table 1, 2, and 3 for datasets DS1, DS2, and DS3 respectively. Also, the ensembles were ranked using a statistical method, Kendall's Coefficient of Concordance, to ascertain which ensemble had the best performance, and the result is shown in Table 4. The highest results are bolded for emphasis.     using the DS2 dataset as shown in Table 2. This dataset contains homogeneous molecules. Although, the comparison of the performance from the three datasets shows that DS2 generated ensembles with high prediction performance. It should be noted that the number of instances is lower compared to the instances in DS1 and DS3.
Using Kendall's coefficient of concordance in ranking the ensembles with the equivalent level of agreement between the raters for each dataset, the following ranking shown in Table 4 was obtained. For dataset DS1, the ranking was of the form, Stacked > Vote > Bagging > Adaboost, with level of agreement, W, of 0.986 and asymptotic significance, P, of 0.000. Dataset DS2 had the ranking in the form, Vote > Adaboost > Stacked > Bagging, with level of agreement, W, of 0.857 and asymptotic significance, P, of 0.001. Also, dataset DS3 had the ranking of the ensembles in the form, Stacked > Vote > Bagging > Adaboost, with level of agreement, W, of 1.000 and asymptotic significance, P, of 0.000.
As shown in Table 4, the Kendall's W, that is, level of agreement, for dataset DS2 is considerably low compared to that of DS1 and DS3. The difference between the mean average VOLUME 7, 2019 of the ensembles in DS2 is also not has high as that of the other datasets. This shows that the ensembles performed well in the prediction. This low difference in the mean average makes it difficult for the ensembles to be effectively ranked thus leading to an agreement level of 0.857 which is above average. It can be deduced that the higher the distinctive difference between the mean average, the higher the level of agreement between the raters, and the lower the distinctive difference between the mean average, the lower the level of agreement between the raters. Generally, the performance of the models built with dataset DS2 is better compared to the other datasets because it contains homogeneous compounds.
The evaluation criteria, that is, accuracy, sensitivity, specificity, precision, recall, and, f-measure, were used as raters in ranking the ensembles and, considering the high level of agreement obtained between the raters (the higher Kendall's W is, the higher the level of agreement), it shows the effectiveness of the ranking method.
The performance of the ensembles in the prediction of bioactive molecules, and the ranking of the ensembles using Kendall's coefficient of concordance shows that Stacked ensemble has better performance compared to other ensembles in the prediction of heterogenous molecules. Vote ensemble makes predictions based on the output of the base classifiers, but stacked passes the output from the base classifier to a meta classifier. This meta classifier tries to correct the wrong prediction made by the base classifiers. This in turn strengthens and improves the overall prediction performance.

V. CONCLUSION
Although stacked ensemble performed better than other ensembles against which it was compared with an accuracy of 9.7002% and 94.9007% for datasets DS1 and DS3 respectively, the performance was lower than the performance of voting and adaboost, with an accuracy of 98.2260% using dataset DS2, which contained structurally homogeneous bioactive molecules. It is therefore recommended that implementation should be tried with other chemical datasets, and Stacked ensemble be used for prediction of structurally heterogeneous bioactive molecules. The three different datasets, which are both heterogeneous and homogeneous, implemented in this study were all retrieved from the MDL Drug Data Report (MDDR) database. It will be worthwhile to implement the ensemble methods on datasets from other repositories. Also, further study to improve the performance of base classifiers, which in turn improves the performance of the ensembles used in predicting bioactive molecules is encouraged, since the combination of the classifiers produce the ensemble, which is expected to perform better than the average classifiers [47], [48].