Benchmarking Small-Dataset Structure-Activity-Relationship Models for Prediction of Wnt Signaling Inhibition

Quantitative structure-activity relationship (QSAR) models based on machine learning algorithms are powerful tools to expedite drug discovery processes and therapeutics development. Given the cost in acquiring large-sized training datasets, it is useful to examine if QSAR analysis can reasonably predict drug activity with only a small-sized dataset (size < 100) and benchmark these small-dataset QSAR models in application-specific studies. To this end, here we present a systematic benchmarking study on small-dataset QSAR models built for prediction of effective Wnt signaling inhibitors, which are essential to therapeutics development in prevalent human diseases (e.g., cancer). Specifically, we examined a total of 72 two-dimensional (2D) QSAR models based on 4 best-performing algorithms, 6 commonly used molecular fingerprints, and 3 typical fingerprint lengths. We trained these models using a training dataset (56 compounds), benchmarked their performance on 4 figures-of-merit (FOMs), and examined their prediction accuracy using an external validation dataset (14 compounds). Our data show that the model performance is maximized when: 1) molecular fingerprints are selected to provide sufficient, unique, and not overly detailed representations of the chemical structures of drug compounds; 2) algorithms are selected to reduce the number of false predictions due to class imbalance in the dataset; and 3) models are selected to reach balanced performance on all 4 FOMs. These results may provide general guidelines in developing high-performance small-dataset QSAR models for drug activity prediction.


I. INTRODUCTION
Drug development often involves extensive investment and time effort on experimental screening of drug candidates. To reduce the resource demand in such drug screening processes, predictive models based on advanced computational methods have been developed to help screen possible drug compounds with high cost-effectiveness [1]- [5]. To date, computational methods based on three-dimensional quantitative structure-activity relationship (3D QSAR) analysis, high-throughput imaging (HTI), and pharmacophore modeling [5], [6]- [10] have succeeded in predicting the effectiveness of drug compounds towards prevalent human diseases (e.g., cancer [10]). Nonetheless, these high-performance The associate editor coordinating the review of this manuscript and approving it for publication was Henry Hess. methods often require user intervention steps on molecular/ ligand alignment [5], [8], [9] or high-resolution images that are not available for all drug compounds [7]. To this end, two-dimensional (2D) QSAR analysis has emerged as a viable alternative method to build predictive models from the widely available chemical structures of drug candidates, which can perform well with no user intervention steps. This analysis correlates the structural details of drug molecules to their effectiveness in biological assays that correspond to specific diseases and builds models that can predict the bioactivity or physiochemical properties of unknown drug compounds [1]- [3], [6].
In 2D QSAR studies, the features of each drug molecule are often coded by a 2D molecular fingerprint, resulting in a numerical vector to describe the presence or absence of substructures in the molecule such as chemical bonds, functional groups, and connectivity pathways [3], [11]. The vectors from drug molecules with known effectiveness to one targeted biological assay (active vs. inactive) will be used to build predictive QSAR models based on machine learning algorithms such as support vector machines (SVM), decision trees, k-nearest neighbors (KNN), and artificial neural network (ANN) [6], [12]. The resulting QSAR models have succeeded in predicting effective drugs of psychological disorders [13], protein-ligand binding affinities [14], and mTOR kinase inhibitors [15].
Nonetheless, current 2D QSAR analysis often relies on training machine learning algorithms with a large-sized drug activity dataset (size > 1000) [6], [16], which requires significant time and effort on both benchwork and statistical analysis. For this reason, developing new drugs can cost hundreds of millions of U.S. dollars [17] and can take over a decade to transition to a marketable state [18]. Given the cost in acquiring these large-sized datasets, it will be useful to examine if 2D QSAR analysis can result in reasonable prediction of drug activity with only a small-sized dataset (size < 100), and moreover benchmark these small-dataset QSAR models in application-specific studies. Such small-dataset QSAR analysis will be especially beneficial at early stages of drug development, when the activity data from potential drug candidates remain limited [19], [20].
Wnt signaling pathways are essential in cell biology and the development of therapeutics for highly prevalent diseases such as cancer, Schizophrenia, and kidney damage [21]- [25]. Some of these diseases (e.g., lung cancer) are associated with altered function/levels of proteins in specific Wnt/β-catenin pathways (one type of Wnt signaling pathway), which lead to elevated gene expression that influences cell proliferation and survival [21]. For this reason, inhibition of Wnt/β-catenin signaling by small molecule modulators (e.g., Niclosamide) is being considered and developed as a candidate cancer treatment [21], [26]- [29]. For instance, screening assays based on live cell imaging have been used to identify Wnt/β-catenin inhibitors [30]. These inhibitors induce the internalization of Frizzled receptor proteins (i.e. moving from cell membrane to cell cytoplasm) in human U2OS cells; such internalized receptors cannot be activated by extracellular Wnt proteins (secreted from other cells), effectively inhibiting the strength of Wnt signaling [21].
Given the clinical significance of Wnt signaling in a variety of diseases and the progress made from screening assays, here we examine if small-dataset QSAR models could facilitate and expedite the process of identifying small molecule inhibitors. If successful, such predictive models and experimental QSAR studies can serve as complementary techniques in screening drug candidates for Wnt/β-catenin signaling inhibition and ultimately add to therapeutics development. To quantify the performance in our analysis, we benchmark 72 QSAR models based on: 1) 4 machine learning algorithms including quadratic support vector machine (QSVM), fine tree, random undersampling (RUS) boosted tree, and bagged tree; 2) 6 molecular fingerprints including fingerprint 2, 3, 4 (FP2, FP3, FP4), molecular access system fingerprint (MACCS), and extended-connectivity fingerprint 4 and 6 (ECFP4 and ECFP6) with three fingerprint lengths for each; and 3) a training dataset of 56 compounds and an external validation dataset of 14 compounds, both of which were experimentally tested in U2OS cells. We evaluate these models using 5-and 10-fold cross-validation and compare 4 figuresof-merit (FOMs) in QSAR analysis including accuracy, area under curve (AUC), sensitivity, and specificity.
Our data show that the model performance is maximized when: 1) molecular fingerprints are selected to provide sufficient, unique, and not overly detailed representations of the chemical structures of drug compounds; 2) algorithms are selected to reduce the number of false predictions due to class imbalance in the dataset; and 3) models are selected to reach balanced performance on all 4 FOMs. These results may provide general guidelines in developing high-performance small-dataset 2D QSAR models for drug activity prediction.

A. DATASETS
To the best of our knowledge, there has been a total of 70 drug compounds available in literature with experimentally validated effectiveness for internalizing Frizzled receptor proteins, and thus inhibiting Wnt signaling in human U2OS cells [21], [27]- [29]. Specifically, all these 70 compounds have been classified as active [inactive] compounds if they were able [unable] to induce the internalization of Frizzled receptor proteins, according to the cell imaging data from before and after applying the compound to the cell culture. As a result, 29 compounds were experimentally tested to be active and 41 were tested to be inactive, suggesting a mild class imbalance between active and inactive compounds. It is noted that 65 of these compounds (except 5 inactive compounds) are derivatives of niclosamide [26], suggesting high structural similarities among these 70 compounds.
In this work, we chose to build 2D QSAR models from the aforementioned 70 compounds for prediction of Wnt signaling inhibition. We chose this dataset because these 70 compounds are all experimentally validated with the same biological assay (i.e., the internalization of Frizzled receptor proteins in U2OS cells) and form a mild class imbalance. It is noted that assays targeted at the dynamics of other Wntsignaling-inhibition related proteins are also available in the ChEMBL database [31]- [36]. However, these assays have yet to experimentally test a sufficient number of active or inactive compounds, therefore making the QSAR modeling challenging (e.g., 3 active compounds in [36]). On the other hand, we found that the size of our dataset, 70, is on par with other small-dataset QSAR studies (e.g., 16 in [19], and 48 in [20]); we thus believe this dataset has a sufficient number of data to build good-performing QSAR models.
We then represented the chemical structures of these 70 compounds listed in the ChEMBL database in a simplified molecular-input line-entry system (SMILES) notation.
Each compound was labeled as 0 (inactive) or 1 (active) by its effectiveness on Wnt signaling inhibition, which was tested in the assay of internalizing Frizzled receptor proteins (Fig. 1). We next randomly selected 80 % of these 70 compounds to form a training dataset (56 compounds; 31 inactive, 25 active) to develop 2D QSAR models and perform cross-validation to statistically analyze their performance; we used the remaining 20 % of these 70 compounds as an external validation dataset (14 compounds; 10 inactive, 4 active) to examine if these QSAR models can predict the activity of compounds that were not used in model training [37]. We noted that the class imbalance of the randomly selected training dataset (25/56 = 44 % active samples) is on par with the class imbalance in the overall 70-compound dataset (31/70 = 41 % active samples) [38].

B. FINGERPRINT REPRESENTATION
To train 2D QSAR models, we used OpenBabel graphical user interface (GUI) to convert the SMILES notation of the compounds in the training dataset to 2D molecular fingerprint representations ( Fig. 1) [39]- [41]. Each of these fingerprint representations is a binary bit vector with a defined length; each bit or group of bits represents the presence or absence of structural features in the compound. For instance, the niclosamide compound is represented by MACCS fingerprint with a length of 128 in the following steps: 1) finding the SMILES notation of niclosamide in the ChEMBL database, O=C(Nc1ccc([N+] (=O) [O-])cc1Cl)c1 cc(Cl)ccc1O; 2) converting this SMILES notation in the OpenBabel GUI to a hexadecimal vector 4a5124612940006 04091001f7aebecf6; and 3) converting the hexadecimal vector to a binary one, 01001010010100010 01001000110 0001001010010100000000000000011000000 10000001001 0001000000000001111101111010111010111 110110011110110. The resulting binary vector and the effectiveness of niclosamide for Wnt signaling inhibition (active) will then be used to train QSAR models using MATLAB Classification Learner application (see Section 2C).

C. ALGORITHMS
Using the fingerprint representations of 56 compounds in the training dataset with known activity for Wnt signaling inhibition, we developed predictive QSAR models based on four machine learning algorithms: QSVM, fine tree, bagged tree, and RUSboosted tree. We selected these algorithms in our benchmarking study since their resulting QSAR models showed the highest accuracy and AUC values among 25 available algorithms in MATLAB Classification Learner application. Specifically, 1) QSVM is a binary classifier to define an optimal hyperplane that maximally separates two classes of high-dimensional data [46]; 2) fine tree algorithm (abbreviated as Fine) uses up to 100 decision rules (i.e. decision tree) for precise classification of the data [47]; 3) bagged tree algorithm (abbreviated as Bagged) first forms several subsets of data that are randomly sampled from the entire training dataset with replacement [48]. Each subset of data will be used to train a decision-tree based sub-model. This algorithm finally makes a robust classification of an unknown data by either voting or averaging the prediction results of this data from all sub-models [49]; 4) RUSboosted tree algorithm (abbreviated as RUSboosted) iteratively trains a series of decision-tree based sub-models, each of which is based on a subset of data formed by randomly under-sampling the majority class of the training dataset to alleviate the class imbalance [50], [51]. During the iteration, each data used for internal validation will increase its weight if it was incorrectly classified during the previous iteration, so that it is likely to be correctly classified in the current iteration. For this reason, the decision tree upon the completion of the iteration is a weighted vote from all involved sub-models and will be used to classify unknown data.

D. MODEL ASSESSMENT
To benchmark our models, we first studied the dependence of their FOMs on the cross-validation folding number k and VOLUME 8, 2020 the fingerprint length, respectively. We then benchmarked the FOM values of these models using the preferred k value and fingerprint lengths, followed by evaluating their capability to predict the activity of the 14 compounds in the external validation dataset.
To evaluate the statistical significance in our results, 1) all these models were trained for 3 independent times to obtain the mean values and the standard deviation of all 4 FOMs; 2) selected models (see details below) were then applied to the external validation dataset to obtain the mean values and the standard deviation of correct predictions. All QSAR models were trained and validated using the MATLAB Classification Learner application, detailed as follows: During the training of QSAR models, we applied the k-fold cross-validation procedure [52], which splits the training dataset into k sub-groups, iteratively selects one sub-group to validate the model trained by the remaining (k-1) sub-groups, and evaluates the model performance by the collective results. Specifically, we compared the 4 FOM values in 72 QSAR models with both 5and 10-fold crossvalidation, which are commonly used in training machine learning models [52], [53]. We then chose one preferred k value for the rest of our analysis based on the overall performance of these 72 models.

2) FINGERPRINT LENGTH
With the chosen k value, we next evaluated the FOM values in 24 models (based on 6 fingerprints by 4 algorithms) with 3 different fingerprint lengths, aiming to balance simplicity, resolution, and uniqueness of the fingerprint representations [40], [54]. For FP2, FP3, FP4, and MACCS, we trained our models using: 1) half the default length, 2) the default length, and 3) double the default length, and chose one preferred length for each fingerprint that yielded higher FOM values than the other two lengths (see details below). For ECFP4, we chose lengths of 2048, 4096, and 8192, whereas for ECFP6, we chose lengths of 1024, 2048, and 4096 in our analysis, because ECFP6 has no default length reported and its performance was suggested to likely improve when the length increases [55].  FN); and specificity = TN / (TN + FP); AUC was defined as the integrated area underneath the receiver operating characteristic curve (i.e., sensitivity versus 1-specificity).
To evaluate if these models can well predict the activity of unknown compounds, we benchmarked their percentage of correct predictions (PCP) out of the 14 compounds in the external validation dataset that were not used in model training [56].

A. FOLDING NUMBER K
We first studied the effect of k values (5 and 10) on FOMs in 72 QSAR models based on 4 algorithms by 6 fingerprints by 3 fingerprint lengths (see representative cases in Fig. 2 and Table 1). If one model shows less than 5 % difference in all 4 FOMs between two k values, or if one model shows that k = 5 and k = 10 yields more than 5 % improvement in different FOMs, we will view this model as one that has no preferred k value. If one model shows more than 5 % improvement in 1-4 FOMs at one k value (either 5 or 10), we will select this k value as the preferred k value for that model. According to these definitions, our data show that: 1) half of the models (33/72, in Fig. 2a) have no preferred k value; and 2) about one quarter of the models (20/72 in Fig. 2b, 19/72 in Fig. 2c) have a preferred k value (either 5 or 10). This result shows that overall k = 5 and k = 10 yield comparable performance among these 72 models. We therefore chose k = 5 for the following analysis.

B. FINGERPRINTS 1) FINGERPRINT LENGTH
Using k = 5, we next evaluated the effect of fingerprint lengths (3 lengths per fingerprint) on FOMs in 24 models based on 4 algorithms by 6 fingerprints (see representative cases in Fig. 3 and Table 2). Our data show that 16/24 models have at least one FOM where one length yields more than 5 % improvement over the other two lengths. If one model shows that different lengths yield more than 5 % improvement in different FOMs, or if one model shows less than 5 % difference in all 4 FOMs among all 3 lengths, we will view this model as one that has no preferred length. If one model shows more than 5 % improvement in 1 to 3 FOMs at one length, we will select this length as the preferred length for that model (note: no model has one preferred length that yields more than 5 % improvement in 4 FOMs). According to these definitions, our data show that 50 % of the models (12/24, Fig. 3a) had no preferred length and 50 % of the models (12/24, Fig. 3b) had a preferred length.  In FP2, FP3, FP4, and MACCS models (a total of 16 by 4 algorithms), we found that: 1) increasing the length from default values does not capture additional structural details of the compounds in their fingerprint representations (i.e., merely adding extra zeros to representation vectors). As a result, half of the models (9/16) do not have more than 5 % improvement in any FOM, whereas 2 of the 7 remaining models do not have their longest length as the preferred length; 2) decreasing the length from default values will make fingerprints lose their resolution and likely fail to capture structural details that are needed to differentiate highly similar compound structures (see Section II.A) [35], [57], [58]. As a result, one quarter of the models (4/16) have more than 5 % degradation in 1 or 2 FOMs, whereas 50 % of the models (8/16) do not have their shortest length as the preferred length.
In ECFP6 models (a total of 4 by 4 algorithms), we found that at the length of 2048 and/or 4096: 1) 1 model has more than 5 % improvement in 2 FOMs than those at the length of 1024; 2) 2 models have more than 5 % degradation in 1 or 2 FOMs than those at length 1024; and 3) one model shows less than 5 % difference in all 4 FOMs compared to those at the length of 1024. This result shows that the ECFP6 fingerprint does not always capture more structural details in our dataset at lengths longer than 1024 [40], [58].
Based on these analyses, we chose the default lengths in FP2 (1024), FP3 (64), FP4 (512), and MACCS (256) models for the rest of our analysis because: 1) only half (9/16) of the models have a preferred length, 2) a longer length often adds no new structural information, and 3) a shorter length often results in a loss of structural details. For ECFP6, we chose to analyze all 3 lengths in the following (1024, 2048, and 4096 labeled as ECFP6A, ECFP6B, and ECFP6C, respectively) because there is no default length reported for this fingerprint [46]. For ECFP4, we again chose to analyze VOLUME 8, 2020 all 3 lengths (2048, 4096, and 8192 labeled as ECFP4A, ECFP4B, and ECFP4C, respectively).

2) FINGERPRINT UNIQUENESS
Due to the structure similarity of the compounds in our dataset, we also examined if these fingerprints at their chosen lengths can uniquely represent the compound structures. If not, there would be identical representation vectors representing both active and inactive compounds, which can result in misclassifications by the corresponding model [3], [59]. From this respective, our data show that FP2, ECFP4, and ECFP6 fingerprints each yield only 2 identical vectors across 56 compounds in the training dataset, suggesting that they can represent most compound structures in a unique vector [39], [45], [58]. In contrast, FP3, FP4, and MACCS fingerprints each yields over 20 identical vectors among the training dataset, suggesting that they are less unique in representing compound structures [58].

C. MODEL FOMS
Using k = 5 and the fingerprint lengths we chose, we next benchmarked the 4 FOM values in 40 models based on 4 algorithms by 10 fingerprints (ECFP4 and ECFP6 each with 3 lengths) (see Fig. 4 and Table 3), with the results described as follows:
Based on accuracy and AUC values, we found that FP2/ QSVM, MACCS/RUSboosted tree, ECFP6B/RUSboosted tree, and ECFP6C/RUSboosted tree models performed the best, whereas FP3/Bagged tree and FP4/Bagged tree models performed the worst. The overall fair performance of the remaining 34 models (50-70 % accuracy and AUC) can result from the small size of the training dataset and the challenge in classifying compounds with similar structures [60].

2) SENSITIVITY AND SPECIFICITY
Our sensitivity and specificity data (Figs. 4c and 4d) show that: 1) except for the FP3/Bagged tree, FP4/Bagged tree, ECFP4A/Bagged tree, ECFP4B/Bagged tree, and ECFP4C/ Bagged tree models, the remaining 35 models have more than 50 % sensitivity; the majority of these models (21/35) show less than 10 % standard deviation; 2) 15 models show > 10 % standard deviation; 3) except for the FP4/QSVM model, the remaining 39 models have more than 40 % specificity; the majority of these models (39/40) show less than 10 % standard deviation; 4) 36 models have less than 40 % difference between their sensitivity and specificity values; of the four exceptions, FP3/Bagged tree, FP4/Bagged, and ECFP4B/Bagged tree models showed low sensitivity (< 5 %) due to a large number of FNs, and high specificity (> 90 %) due to a small number of FPs. The imbalance between sensitivity and specificity in FP3/Bagged tree, FP4/Bagged tree, and ECFP4B/Bagged tree models is likely due to a significant bias they develop to the majority class (inactive compounds) in our training dataset. This bias can result from the class imbalance in our training dataset (31 inactive versus 25 active) [60], [61], which can make these models form classification rules primarily on inactive compounds. This in turn would lead to 1) misclassifications of active compounds, thus increasing the number of FNs [60] and 2) overall a small number of true predictions, thus decreasing the number of FPs. Furthermore, such imbalance can be worsened by the way the bagged tree algorithm from sub-models based on randomly sampled subsets of the entire training dataset. Such sampling process may drop active compounds and result in subsets where inactive compounds are even more dominated (i.e., yielding a greater imbalance between inactive and active compounds) [48], [60], [62], [63].
Overall, our sensitivity and specificity data highlight the importance of benchmarking all 4 FOMs when evaluating the model performance. Accuracy and AUC alone may not fully capture the downside of the model performance, such as the imbalance between sensitivity and specificity trained from imbalanced training datasets.

D. MODEL VALIDATION
To evaluate if the aforementioned 40 models can predict the activity of unknown compounds, we examined their PCP on 14 compounds (10 inactive versus 4 active) in the external validation dataset (see Fig. 5 and Table 4  These results suggest the promise of our small-dataset models in predicting Wnt inhibitors. For each of these models, we compared its PCP from the validation process (Fig. 5) with its accuracy value from the training process (Fig. 4a) to check if it is an overfitted model [56]. Our data show that: 1) PCP is more than 15 % lower than the accuracy in 1 FP2 model and 2 ECFP6A models; and 2) PCP is less than 15 % lower than the accuracy in all ECFP4C, ECFP6B, and ECFP6C models. For models listed in the first category, PCP is significantly lower than the accuracy, suggesting that these models likely overfitted compound structures (e.g., captured unnecessary structural details) in the training dataset [64].
Based on both PCP and 4 FOMs of these 40 models, we observe that ECFP4 and ECFP6 fingerprint at the longer lengths offers unique and sufficient representations of structural details with no overfitting. In contrast, FP3, FP4, and MACCS fingerprints also show no overfitting but fail to offer unique representations. FP2 fingerprint features high accuracy and AUC but also shows overfitting. These results suggest that fingerprints should be chosen to sufficiently, uniquely, but not overly represent structurally similar compounds in developing high performance small-dataset QSAR models.

E. PERFORMANCE COMPARISON
We finally remarked that the FOMs in our QSAR models are on par with other computational methods used for drug discovery. For instance, Mayr et al. have comprehensively studied ca. 500,000 drug compounds across more than 1000 assays. They built predictive models of the drug activity (in the respective assay) by machine learning algorithms [65]. By averaging the AUC values of each model, they reported typical AUC values around 70 %. As another example, Hofmarcher et al. have built predictive models from over 30000 compounds across 209 assays by neural network algorithms [66]. By averaging the FOMs over all assays, they reported typical accuracy values around 77 %, AUC values around 70 %, sensitivity values around 50 %, and specificity values around 76 %. In comparison, our models typically obtained accuracy values around 65 %, AUC values around 70 %, sensitivity values around 70 %, and specificity values around 60 %. Nonetheless, we noted that computational methods on prediction of Wnt signaling inhibitors are still at their early stage of development at this moment. We expect that future efforts on this essential field of cell biology will allow more direct comparison with our QSAR models.

IV. CONCLUSION
In this study, we present a systematic small-dataset QSAR study for prediction of effective Wnt signaling inhibitors that are essential to therapeutics development in prevalent human diseases. Specifically, we trained 72 QSAR models based on 4 algorithms, 6 fingerprints, and 3 fingerprint lengths using a training dataset (56 compounds), evaluated their performance on 4 FOMs, and examined their PCP using an external validation dataset (14 compounds). Our data show that the model performance is maximized when: 1) molecular fingerprints are selected to provide sufficient, unique, and not overly detailed representations of the compound structures (i.e. to avoid fingerprint lengths that lose fine structural features, identical representation vectors for multiple compounds, and overfitting); 2) algorithms are selected to reduce the number of false predictions due to class imbalance in the dataset; and 3) models are selected to reach balanced performance on all 4 FOMs. These results may provide general guidelines in developing high-performance small-dataset 2D QSAR models for drug activity prediction. Moving forward, it will be useful to test if these guidelines would apply to QSAR studies based on other Wnt signaling related assays. To achieve this, we will need to expand the experimental data in those assays, which are often associated with other targeted proteins (e.g., Wnt-3a, kinases) or host cells (e.g., MCF7, ST14A).

ACKNOWLEDGMENT
(Mahtab Kokabi and Matthew Donnelly contributed equally to this work.)