Efficient Compound Activity Prediction Model Based on Bayesian Regularization Method for Candidate Drugs

In breast cancer diagnoses, antagonist compounds are used to resist α Subtype of Estrogen Receptor (ERα) bioactivity. However, those compounds are difficult to be obtained in the process of drug screening. In this paper, an efficient bioactivity prediction model is proposed with consideration of Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) properties as an auxiliary prediction to achieve the minimization of the damage caused to human body whereas maximizing the efficacy of estrogen activity inhibitors. In the proposed prediction model, Pearson correlation analysis and Bayesian regularization algorithm under neural network are specifically used to train and analyze the data of bioactivity. To calculate the loss of neural network, cross-entropy and supervise learning are applied. The results show that the accuracy of our proposed compound prediction model can reach up to 92.2%, 94.3%, and 90.2% for Caco2, CYP3A4, and hERG, respectively. Moreover, the Area Under the Curve (AUC) value of our proposed model is 0.9752, which is 13-14% higher than that of SVM method. In addition, the Matthews Correlation Coefficient (MCC) can reach up to 0.93±0.03, 37-40% higher than that of SVM method. Due to correlation analysis and valid data screening, our model can shorten the prediction time by 15%. Due to its advantages in accuracy and speed of prediction, this research can contribute to the upgrading of diagnosis and treatment of breast cancer.


I. INTRODUCTION
Breast cancer is one of the most common and highly lethiferous cancers in the world, whose development is closely related to Estrogen Receptor [1][2][3] . Research has found that Estrogen Receptors alpha (ERα) is expressed in no more than 10% of normal breast epithelial cells but about 50-80% of breast tumor cells. Based on the feature, antihormone therapy is commonly used by breast cancer patients. It controls estrogen level by regulating estrogen receptor activity [4][5] . Therefore, compound which can resist ERα bioactivity is considered as an important target for treating breast cancer. To save time and cost in drug discovery, compound activity prediction models are usually used to screen potential active compounds. Before building a model, the expected compounds need to possess some basic features, such as good bioactivity, pharmacokinetic properties and nontoxicity in human body which are collectively known as ADMET. No matter how active a compound is, if its ADMET properties do not meet the requirements, for example, if it was difficult to be absorbed by the human body, or the metabolic rate in the body was too fast, or it had biotoxicity, it would be difficult to be made into medicine, thus the ADMET properties need to be accurately evaluated.
Previously, scientists examined molecules of a substance in the laboratory to determine its effects on human bodies or animals. However, it is hard for them to check all the compounds in the database because it is highly timeconsuming and expensive. To solve this problem, virtual screening was created, which is a method that can match known compounds with unknown ones. One of the traditional techniques is similarity search [6] . It is a ligandbased technology, which can discover the similarity between molecules or the substructure of distance molecules, but it was only applied within a limited scope. In recent years, machine learning technology has been widely used in the field of chemical informatics with a lot of methods being developed, such as Naive Bayes (NB), Support Vector Machine (SVM), Single hidden Layer Feedforward Network (SLFN), Logistic Regression (LR), and Deep Neural Networks (DNN) [7][8][9] . When using SVM method, the model trained with large-scale samples often has the defect of slow speed, and is sensitive to missing data, parameters and kernel functions [10] . Hence, the traditional SVM is not suitable for multi-classification. Meanwhile, apart from the low accuracy, too much iterative calculation and tuning need to be consumed for SLFN, leading to a long training time [11] . In this regard, Kudisthalert et al. discovered Extreme Learning Machine (ELM) prediction model, using single-layer feedforward neural network structure. It selected the weight from the training randomly and processed the data through a generalized inverse function, which can produce a function with small training error. Without performing iterative process to train a model, the training time for ELM can be remarkably reduced compared with SLFN. Yet this model is not widely used due to the complexity of function calculation of approximate hidden layer nodes. These models are usually measured by two main parameters, namely Area Under the Curve (AUC) and Matthews Correlation Coefficient (MCC) [12][13][14] .
In this paper, aiming at the high complexity of drug screening, Pearson correlation analysis is used to screen the training data. In dealing with the insufficiency of accuracy of the drug prediction, a quantitative prediction model of compound bioactivity is constructed. A three-layer neural network under Bayesian regularization algorithm is established for effectively training and analyzing the bioactivity data of 1942 compounds. In the meantime, considering the wellbeing of human bodies, a classification prediction for predicting ADMET properties is added to the model, and the loss of neural network model are calculated by cross entropy and supervised learning.

II. Method and Principle
In this part, we have introduced the method and principle of our proposed model. Including noise-filtering neural network, the noise filtering of training data, and the Bayesian Regularization Algorithm.

A. Noise-Filtering Neural Network
To promote the efficient screening of drugs and train a practical drug-active prediction system, we constructed the prediction model shown in Figure 1. In our work, the whole prediction model is simplified and described as two parts, pre-processing and post-processing. The "pre-processing" evaluates the correlation of all variables affecting the prediction results through correlation operation and Bayesian analysis. The dependent variables which have low correlation with the prediction results and constant in all training data are eliminated to filter the noise, reduce the complexity of model training, and improve the accuracy. "Post-processing" refers to that our prediction model based on Bayesian regularization and three-layer neural network is constructed to train the relationship between molecular descriptors and biological activities of known compounds. The number of nodes in the middle layer can first obtain an approximate value through empirical formula, and then obtain an optimal value through adjusting parameters in a certain range. The weights of each layer of neural network nodes are trained by supervised learning. And then the unknown compounds are predicted and compared with the real value to verify the effectiveness and practicability of our proposed model for drug screening.
For Biological Activity (BA) prediction, the BP neural network and Bayesian regularization algorithm are combined to significantly improve the generalization ability, convergence speed and approximation accuracy of BP neural network for the best BA prediction effect. For the simplified ADMET prediction (de facto two classification prediction), the conventional BP neural network is used for classification prediction. In Figure 1, "L1Nn" refers to layer 1 node n, supposing n nodes are in layer 1. Similarly, "L2Nm" represents layer 2 node m, assuming there are m nodes altogether in layer 2. "BA" indicates biological activity, "A, D, M, E, and T" represents five properties to be tested, whose performance can be judged by the output. h m n a    (1) The number of hidden layer nodes has an impact on the performance of neural network. An empirical formula can determine the number of hidden layer nodes, as shown in formula (1), in which h is the number of hidden layer nodes, m is the number of input layer nodes, n is the number of output layer nodes, and a is the adjustment constant between 1 and 10 [15] .
After obtaining the number of hidden layer nodes, the ADMET properties begin to predict, as a parameter to measure the comprehensive index of drugs, simplified as 0 and 1 which means inferior and superior. The steps of predicting this parameter are divided into three steps: normalization processing (formula 2), weight matrix calculation and summation (formula 3) and activation function output (formula 4 and formula 5). The second layer repeats a similar operation of the first layer.
In formula (3), Wij represents the weight between node i and node j, whereas bj acts for the threshold of node j, and xj stands for the output value of each node. The output value of each node should be realized according to the output value of all nodes in the upper layer, the weight of the current node and all nodes in the upper layer, the threshold of the current node and the activation function [16][17] .

B. Noise Filtering of Training Data
Pearson correlation coefficient is often used to calculate the correlation between variables. From a mathematical point of view, it is defined as "the covariance between two vectors is normalized by the product of their standard deviation" [18][19][20] . The covariance between two pairs of vectors is the measurement of their fluctuation trend up and down the mean. That is, it measures whether a pair of vectors tend to be on the same side or opposite to their respective averages [21][22][23] . The covariance is calculated by subtracting the respective mean value from each pair of variables, and then multiplying the two values, as shown in formula (6).
To obtain more reliable numbers, normalized covariance can be obtained by dividing covariance by the product of the standard deviation of two vectors [24][25][26] . As shown in formula (7), the Greek letter ρ is used to express Pearson correlation coefficient. As shown in equation (8), the covariance of two identical vectors is also equal to their variance. Therefore, the maximum value of the covariance between two vectors is equal to the product of their standard deviation, which occurs when the vectors are completely correlated [27] . The correlation coefficient is limited to [-1, 1].
According to the correlation analysis, the Correlation Coefficient function is mainly used to calculate the correlation of variables by Pearson correlation algorithm, as shown in formula (9). Where, C (i, j) is the covariance matrix, R (i, j) is the correlation coefficient matrix, and the correlation coefficient is calculated from the covariance.

C. Bayesian Regularization Algorithm
When the size of the training sample set is certain, the generalization ability of the network is directly related to the scale of the network. If the size of the neural network is much smaller than that of the training sample set, the chance of "over training" would be very small. However, for solving specific problems, it is often difficult to determine the appropriate network size (the number of hidden layer neurons). The regularization method improves the generalization ability of neural network by modifying the training performance function of neural network. In general, the training performance function of neural network is expressed by formula (10) [28][29] .
In the regularization method, the mean square sum of network weights is added to the typical objective function, and the network performance function is improved to the formula (11) and formula (12) with ζ1 and ζ2 being the parameters. If ζ1 >> ζ2, the training algorithm minifies the network error. Conversely, if ζ1 << ζ2, the training emphasizes the reduction of weight, leading to a relatively great network error, which eventually makes the output of the neural network smoother and prevents the occurrence of minimum points [30] .
By using the new performance index function, the network is adjusted to a small weight and the effective weight of the network is minimized under the condition of ensuring that the network training error is as small as possible, that is equivalent to automatically reducing the scale of the network. It is difficult to determine the scale coefficient (ζ1 and ζ2) by conventional regularization methods, whereas the Bayesian regularization method adjusts the weight and threshold of the network according to the Levenberg Marquardt optimization theory, that adaptively can adjust the parameters (ζ1 and ζ2) of the objective function in the network training and make it optimal [31] .

III. Results and Discussion
In this part, we have shown the results of correlation analysis, bioactivity prediction, and ADMET properties prediction, respectively. Then, we have discussed these results.

A. Correlation Analysis
For obtaining objective and accurate prediction results, we have downloaded bioactivity data tables and molecular descriptor information tables of 1974 compounds from PubChem database. The bioactivity data training table contains three columns. The first column provides the structural formulas of 1974 compounds, represented by SMILES (Simplified Molecular Input Line Entry System); the second column is the bioactivity value of compounds, who have impacts on ERα; the third column is pIC50, which is usually positively correlated with bioactivity. In the molecular descriptor information table, the first column is the SMILES formula of the compound, followed by 729 columns, and each column represents a molecular descriptor of the compound.
The correlation coefficients between each molecular descriptor variable and bioactivity variable are calculated according to Correlation Coefficient function, with the coefficients displayed from left to right according to the list number, as shown in Figure 2. As shown in the figure, the number of variables is as high as 729, and the amount of data obtained is relatively large. It is impossible to intuitively see the 20 variables with the largest correlation coefficient. The molecular descriptors with the most significant effect on biological activity can be obtained by rows sorting function. Due to correlation analysis and valid data screening, our proposed model can shorten the prediction time by 15%.

B. Prediction of Biological Activity
In this part, the network parameters are set as follows: the number of network epochs is set as 100, the expected error goal as 0.00001, and the learning rate LR as 0.1. After being tried, it is safe to say that when the number of hidden layer nodes is 14, the test accuracy of the test set is the highest at 86%, hence the number of hidden layer nodes is set to 14.
Then the neural network model trained is tried firstly under Levenberg Marquardt (LM) algorithm, the Relevance (R) value is obtained as 0.77761 and Mean-Square Error (MSE) is 0.72564. It is apparent that it cannot meet the expectation in the MSE. For comparing with other training method, the Bayesian algorithm is used to train the neural network. The given 1974 data sets are divided into two parts, with the first 1800 data sets trained by Bayesian method, and the remaining 174 data sets being tested.
As the predicted results shown in Figure 3, to test the error between the real value and the predicted value of 174 molecular descriptor variables in the test set and their error percentage, the red broken line and blue broken line are correspondingly the actual value and predicted value of biological activity of each molecular descriptor. In the figure, the test values and the actual values are showing a similar trend, the error percentage of nearly 93% of the data is controlled within a range of 3%, proving that the established model can make an accurate prediction of the data. Figure 4 shows the error distribution histogram of training set and test set. The blue column and red column represent the number of training sets and test sets with errors within this range respectively. It is not difficult to see that the training error and the test set error are concentrated in [-0.2785, 0.3289], and only a few data errors have absolute values greater than 1.5. Figure 5 is the relationship image of the MSE value of the training set data and the number of test sets with the increase of the number of epochs, in which the red curve and the blue curve respectively are the MSE values of the test data set and the training data set. In the first two epochs, the MSE decreased rapidly from 10 to 1, and from the 81st epoch, the   test MSE and training MSE of the model basically remained at 0.40184. Compared with the MSE value of LM algorithm, the MSE under Bayesian algorithm has been greatly reduced. Figure 6 shows the correlation distribution images of each sample in the training set and the test set on the proposed model. The training set data correlation R is 0.89387, whereas the test set data correlation R is 0.82205, and the overall correlation R reached 0.88167. As a result, the accuracy of our proposed model trained by Bayesian method on the test data set is higher than that of LM algorithm, because the model training under LM algorithm automatically stops when the generalization stops improving, which means it requires less training time but is lack of accuracy. Differently, the model training through Bayesian regularization stops according to adaptive weight minimization, leading to a higher accuracy, especially for the prediction of complex and noisy datasets.

C. Prediction of ADMET Properties
To make prediction of the ADMET properties closer to reality, the ADMET properties tables of 1974 compounds are downloaded from ZINC database. For facilitating modeling, only five ADMET properties of compounds are considered: 1) Caco2, which is used to measure the ability of compounds to be absorbed by human body; 2) CYP3A4, which is the main metabolic enzyme in human body, used to measure the metabolic stability of compounds; 3) hERG, which measures the cardiotoxicity of the compound; 4) HoB, a property can measure the proportion of drugs absorbed into the human blood circulation after it enters the human body; 5) MN, a method to detect whether the compounds have genotoxicity.
According to the empirical formula, when the input port is 20 and the output port is 1, the number of nodes in the middle layer is between 28 and 38, which is a relatively optimal value. Therefore, the numbers of intermediate nodes of the constructed neural network model are tested, and the test accuracy of the five properties of ADMET with the best training data of each node number are plotted. As Figure 7 shows, Caco2, CYP3A4, hERG, HoB, MN can be obtained the best prediction performance of 92.2%, 94.3%, 90.2%, 87.2%, and 92.2%, respectively, when the numbers of hidden nodes are 34, 33, 30, 37, 28, severally. Through appropriate hidden layer nodes, the highest prediction accuracy of the properties of each compound is achieved.
In order to predict the ADMET performance of other unknown compounds, Cross Entropy loss function is used to judge the performance of the neural network model on the samples. It divides 1974 groups of sample data into training set, validation set and test set in the ratio of 14:3:3. The training set is used to train the network and adjust the weights of the internal transmission matrix, the function of the verification set is to find the best number of training epochs, whereas the test set acts as the final performance evaluation   of the neural network model. Taking CYP3A4 as an example, the confusion matrix evaluation diagrams will be obtained as Figure 8, with the green modules indicating correct predictions, and the red modules indicating that the predictions are wrong. Figure 8 represents that in the training data, 334 plus 1011 of the main diagonal are the sum of the correct parts of the training, accounting for 97.3%, whereas the error accounts for 2.7%. With the same method, the sum of correct parts in validation data, test data and summary data are 94.9%, 94.3%, and 96.5%. That is, for the neural network model we trained, through 729 parameters, the accuracy of CYP3A4 data prediction can reach up to 94.3%. It shows that the prediction accuracy of our model is convincingly high.
The Receiver Operating Characteristic (ROC) curve is usually used to measure the performance of the model. The abscissa of the plane is False Positive Rate (FPR), and the ordinate is True Positive Rate (TPR). For a classifier, we can acquire a TPR and FPR point pair according to its performance on the test sample. Generally, the closer ROC curve is to the (0, 1) point, the better the prediction performance of the binary classification model is. The ROC curve can be calculated from the confusion matrix in Figure  8, as shown in Figure 9. The best points on ROC curve of training set, verification set, and test set are (0.054,0.946), (0.075,0.925), (0.05,0.95), respectively. Boltzmann-Enhanced Discrimination of ROC (BEDROC) is a score that is more representative of compound prioritization, since it is biased towards early enrichment.
Another parameter to measure the quality of the model, Matthews Correlation Coefficient (MCC), is the correlation coefficient between observed and predicted binary classifications, with the values staying between [-1,+1]. When the coefficient equals to +1, it indicates perfect prediction, 0 indicates that it is no better than random prediction, and -1 refers to complete inconsistency between prediction and observation. MCC can be calculated directly from the confusion matrix using formula (13). In this formula, TP stands for true positive, TN represents true negative, FP acts for false positive and FN means false negative.
As shown in Table. Ⅰ, five general parameters are used to measure our model. The BEDROC values predicted by several widely used methods such as NB, SVM, LR, DNN_PCM are respectively 0.79±0.08, 0.88±0.05, 0.88±0.06, and 0.93±0.03. The BEDROC value predicted by the three-layer neural network model under the Bayesian regularization algorithm we proposed can reach 0.95±0.03, which is 2-3% higher than the highest value of the previous models. The MCC values predicted by NB, SVM, LR, DNN_PCM are 0.41±0.03, 0.56±0.07, 0.51±0.06, and 0.53±0.07, sequentially. The MCC value of our model is 0.93±0.03, which is 37-40% higher than the highest MCC value of the previous models. The AUC values predicted by NB, SVM, LR are correspondingly 0.7875, 0.8126, 0.8402, whereas our AUC value is 0.9752, which is the highest among them. The data comparisons indicate that the prediction performance of our proposed model is extraordinary.
After processing 1974 samples, the remaining 543 samples with high correlation are selected. The data of the first 20 samples with better biological activity to inhibit ERα are then processed, as shown in Figure 10. Taking the molecular Lipoaffinity Index as an example, when the molecular descriptor value stays from 10 to 12, the ADMET properties expresses with the highest frequency, about 28 times. For molecular n6Ring, when the molecular descriptor value is between 2.5 to 4.5, the frequency of ADMET properties reaches the highest, about 32 times. 20 samples are listed in Figure 10, who are compounds screened by our proposed model that not only have high anti-ERα activity,  Method and Algorithm NB SVM LR DNN_PCM Our Work BEDROC [12] 0.79±0.08 0.88±0.05 0.88 ± 0.06 0.93±0.03 0.95±0.03 MCC [13] 0.41±0.03 0.56±0.07 0.51 ± 0.06 0.53±0.07 0.93±0.03 AUC [14] 0 but also show excellent ADMET properties for multiple times in a specific interval. The result shows that the model we designed has successfully achieved the purpose of screening effective anticancer drugs with little side effects.

IV. Conclusion
For future drug screening, the contribution of computers can greatly reduce meaningless steps and provide strong support for drug discovery. In dealing with the massive number of parameters that need to be considered in model training, this paper presents a method that takes into account both the burden of computing power and prediction accuracy. An efficient quantitative prediction model (with AUC of 0.9752, MCC of 0.93) of compounds bioactivity has been constructed. During the training, Pearson correlation analysis, Levenberg Marquardt algorithm and Bayesian regularization algorithm were used to effectively train and analyze the data. Moreover, the ADMET properties were also considered, using cross-entropy to calculate the loss of neural network model and supervise learning. Our training data sets are more extensive and the prediction accuracy of our proposed model is higher, with the values of AUC and MCC much higher than other methods. This research can improve the efficiency of screening the compounds who can inhibit the ERα bioactive, providing new ideas for the treatment of breast cancer.