An Ensemble-Learning Based Application to Predict the Earlier Stages of Alzheimer’s Disease (AD)

The fact that ensemble methods enhance the prediction performance. Therefore, we focused on developing a weighted ensemble method using a novel combination of Cerebrospinal Fluid (CSF) protein biomarkers to predict AD’s earlier stages with greater accuracy than the state-of-the-art CSF protein biomarkers. In this regard, two feature selection methods, namely the Recursive Feature Elimination (RFE) and L1 regularization method were used to screen the most important subset of features for building a classification model using the Mild Cognitive Impairment (MCI) dataset. A novel combination of three biomarkers, namely Cystatin C, Matrix metalloproteinases (MMP10), and tau protein, was screened using the linear Support Vector Machine (SVM) and Logistic Regression (LR) classifier based RFE method. Two-tailed unpaired t-test analysis at a 5% significance level showed a significant difference between the mean levels of Cystatin C, MMP10, and tau protein between cognitive normal and cognitively impaired groups. An ensemble model using a weighted average of two best performing classifiers (LR and Linear SVM) was created using a novel subset of three most informative features. Our ensemble model’s weighted average results performed significantly better than LR and Linear SVM base classifiers’ performance. The Receiver Operating Characteristic Curve (ROC_AUC) and Area under Precision-Recall values (AUPR) of our proposed model were observed to be 0.9799 ± 0.055 0.9108 ± 0.015, respectively. The performance of our proposed weighted averaged ensemble model built using a novel combination of CSF protein biomarkers was significantly better (p < 0.001) than models generated using different combinations of CSF protein biomarkers obtained from recent studies. An ensemble-learning based application was implemented and deployed at Heroku at https://appsalzheimer.herokuapp.com.


I. INTRODUCTION
Alzheimer's disease results in a neurodegenerative disorder that causes irreversible and progressive brain cell damage, usually affecting people during their mid-60s [1], [2]. Preclinical changes in the brain associated with Alzheimer's begin years before the onset of the disease's typical clinical The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . symptoms. Though the onset of AD cannot be reversed or stopped, early detection of the disease can allow treatment and spontaneous care of Alzheimer's patients in their earlier stages before irreparable damage to the brain has occurred [3], [4]. Therefore, studies' on the development of new strategies for the earlier diagnosis of AD are among the most active research areas in Alzheimer's science. Researches in the past have shown that individuals with MCI or presymptomatic Alzheimer's have a greater risk of eventually converting to AD [5]- [10]. The biochemical changes in CSF associated with AD's progression provide a sound and potential source of diagnostic biomarkers to study the disease's preclinical and clinical stages. Thus, for the early detection of AD, identifying specific biomarkers that play a significant role in converting MCI to AD has become of great interest in recent times [11]- [14]. In the current scenario, conventional CSF biomarkers, namely tau, amyloid-β42 (Aβ42), and phosphorylated forms of tau (p-tau), have shown more significant potential in the screening of MCI patients who eventually progressed to clinically diagnosable AD [15]- [19].
AD is multifactorial by nature; therefore, a substantial overlap of the clinical biomarkers between the cognitively healthy and cognitively impaired individuals is observed [20]. Thus, considering the present scenario, there is an urgent need to identify novel protein biomarkers from CSF. Identifying new CSF protein biomarkers will enhance our current understanding of the pathophysiology of MCI and the progression of MCI patients to AD. However, herein lies a problem as the analysis and interpretation of multiple factors (clinical biomarkers) involved in converting MCI to AD are difficult and complicated. Therefore, in recent years the applications of pattern recognition and Machine Learning (ML) algorithms in Computer-Aided Diagnostic (CAD) tools have become invaluable. These tools are highly efficient in analyzing and interpreting AD's multifactorial nature, enabling clinicians and scientists to distinguish between healthy controls and patients with MCI who convert to AD. Several machine learning approaches have been used on a different combination of features, namely CSF biomarkers, genotypes, brain imaging, clinical information, and demographics, to predict MCI subjects' conversion to AD with varying levels of precision and accuracy [11], [6], [21]- [27].
Nevertheless, only a few studies have looked for novel CSF protein biomarkers that can augment the accuracy of current, leading CSF protein biomarkers (Aβ42, Tau, p-tau181) in distinguishing MCI patients from healthy subjects and, at the same time, detecting the earlier stages of AD [11], [13]. Lately, ensembling has gained importance in classifying a progressive form of MCI, which ultimately leads to AD from cognitively normal individuals [23], [28], [29]. Therefore, we propose an ensemble learning model to identify additional CSF protein biomarkers, fitted on attributes (CSF protein biomarker) selected using the RFE method.
The Alzheimer dataset generated by Craig Schapiro et al. 2011 [11], comprising of both demographic and CSF protein biomarkers, was employed to build our model for predicting the earlier stages of AD. Our proposed ensemble model can classify MCI patients from cognitively normal subjects with better sensitivity, specificity, and accuracy than state-of-theart CSF protein biomarkers' (tau, Aβ42, p-tau) based classification models [11]. Our CSF protein biomarkers' based web application is the first of its kind to predict the earlier stages of AD. The application is written in python, developed in Flask, a micro web framework, and deployed on Heroku at ''https://appsalzheimer.herokuapp.com.'' The paper onwards is divided into the following sections: 1) The Material and Methods section provides information about Alzheimer's clinical datasets' source and contents. It also provides a detailed description of the data preprocessing methods, supervised learning algorithms, and model evaluation metrics employed for assessing the performance of various models built using Alzheimer's clinical data. 2) The Result section briefly explains the results of various methods employed to generate an efficient predictive model for predicting the earlier stages of AD 3) The Discussion section describes our model efficiency in differentiating cognitively impaired subjects from cognitively normal subjects. 4) Finally, the Conclusion section provides the concluding remarks and scope of our model and its implementation in predicting the earlier stages of AD. A pictorial representation of the methodology adopted in our study to develop an ensemble-learning-based web application for predicting the earlier stages of AD is depicted in Fig. 1.

A. DATA SOURCE AND DESCRIPTION
The MCI clinical dataset was downloaded from Figshare [30]. The cognitive status of the 333 subjects involved in the clinical study was, according to the Clinical Dementia Rating Scale (CDR). The clinical dataset consisted of 91 mildly cognitive impaired subjects (CDR 0.5 and CDR 1) and 242 cognitively normal subjects (CDR 0). The MCI clinical dataset consisted of features including 124 credible CSF protein biomarkers, demographic features, namely age and gender, and a set of non-imaging protein biomarkers, namely, amyloid proteins, native tau, phosphorylated form of Tau (pTau), Aβ42, β42, and also allelic variants of Apolipoprotein E genotype (E2, E3, and E4). The Apolipoprotein E variant E4 is the most significant among all other Apolipoprotein E genotype variants because of its association with AD [31], [32].

B. PREPROCESSING OF THE MCI CLINICAL DATASET
In a Machine Learning process, preprocessing of the data is that step in which the data gets cleaned, transformed, or Encoded, and reduced to process the data to such a state that the machine algorithms can easily learn and analyze the preprocessed data to build a better predictive model.
The current section discusses the type of data processing steps involved in the present study, as illustrated in Fig. 1. Further, a one-hot encoded technique was used to transform the categorical features into a binary representation. The Z-score normalization was used for performing the standard scaling of the numerical features present in the MCI dataset. The correlation between the numerical-numerical and numerical-categorical features was checked using the Pearson Correlation Coefficient (PCC) to screen out the correlated features from the MCI dataset. In the current study, two hundred copies of the MCI dataset were generated using stratified splitting with the replacement for building a statistically meaningful model. Finally, each copy of the randomly generated two-hundred copies of the dataset was segmented into 80% training-cum-validation data and 20% testing data. Feature selection was performed using the different methods, namely the wrapper method and the embedded method. The feature selection methods mentioned above select the best subset of features based on feature importance/information, thereby selecting the best set of predictors and ignoring the redundant and less important features for building a better predictive model. In the present study, we have focused on the RFE, an important example of a wrapper based feature selection process and L1 regularization with an L1 penalty to screen the most important subset of features, an example of an embedded based feature selection method. A detailed description of the different preprocessing techniques employed in the present study mentioned above is discussed below in the following subsections.

1) ENCODING OF CATEGORICAL FEATURES
The non-continuous categorical features of the MCI dataset were one-hot encoded, which essentially transforms each categorical feature with ''n'' categories into ''n-1'' binary categories, a format suitable for downstream estimators.

2) DATA NORMALIZATION
The Z-score normalization was used to normalize the numerical attribute data points. Normalization assists in scaling the data within a range to avoid training and building incorrect Machine learning-based predictive models. In Z-score normalization, the dataset's scaling was performed using µ and σ computed on the MCI dataset.
Refer to ''equation 1'' shown above for calculating the z-score normalization of the numerical variable in the MCI dataset.

3) FEATURES CORRELATION
Features having a Pearson coefficient of correlation >0.99 were checked using Person Correlation. Pearson correlation measures the linear relationship between the two predictors X and Y. A value close to one indicates the highest correlation between the two predictors, while a correlation coefficient of zero indicates no linear correlation between the two predictors. As no predictors were highly correlated (>0.99), therefore none of the predictors were deleted.

4) STRATIFIED FEATURE SAMPLING
As the number of instances in the negative class, i.e., the cognitively normal subjects (CDR 0), is significantly higher than the positive class (CDR > 0), it leads to an imbalance in the dataset. Thus to have an adequate representation of the rarer class during the training cum validation process, a stratified sampling method with replacement was used for feature sampling.
A random 200 copies of the dataset were generated to average out results and get a more realistic estimate for the small MCI dataset. Two hundred copies of the dataset were created, and within each copy, a random train-test split of 80-20% was performed as the MCI dataset has a limited number of samples for building a statistically meaningful model.

5) FEATURE SELECTION
Feature reduction was applied to the training dataset in order to select an optimal subset of the attributes. Feature reduction was performed because the MCI clinical dataset's feature space is quite large, and not all attributes contribute effectively in predicting the dependent variable (class). Removing features having minimal importance reduces model variance while also improving training and inference time.
In this regard, two different types of Feature Selection Methods have been applied in this study, namely, the wrapper-based method and embedded method. Wrapper methods employ a greedy approach to iteratively look through the space of possible feature subsets, evaluating each subset of attributes based on a given machine learning algorithm performance. The wrapper-based feature selection method evaluates the possible interaction between the attributes to look for the best possible combinations of attributes that result in the best performing model, as shown in Fig. 2. Besides, testing all possible combinations of features can be computationally expensive. However, if the dataset is small, particularly when the number of attributes is small, each run would be computationally cheap.
In this study, we have used the RFE method, a popular example of the wrapper method of feature selection. The RFE method [33] was used to automatically prune the least essential attributes from the given set of features in the dataset. As the name suggests, RFE recursively eliminates the feature with the smallest coefficient on every iteration until the desired number of attributes is eventually reached. The RFE based feature selection method removes large correlations among features while also providing the best combination of the desired number of informative attributes for the effective prediction of the dependent variable.
Since there is a class imbalance in the dataset, and the minority class CDR > 0 is of importance to us, even at the cost of some misclassification for CDR 0 (majority class), RFE utilizing cost-sensitive estimators with weights, 0.2 for class ''0'' and 0.8 for class ''1'' was used. Practically with the Scikit-learn library, we can use any classifier (estimator) to do the feature subset search with RFE. Therefore, the RFE method, combined with LR, DT, RF, and Linear SVM individually, was used to search for the best informative features subset to build a more robust and effective classifying model.
Besides, using a combination of each ML algorithms and the RFE method, we will know whether the feature subset looks the same or different, or their performance in predicting the target variable differs or remains the same. A series of feature sets containing features ranging from 1 to (1 + n) were selected. Here, ''n'' varies from 0 to 4. Likewise, the number of attributes removed at each iteration in RFE was set to be three features. After each iteration, we removed the three least important features instead of 1 to reduce our model's runtime complexity. Thus after recursive elimination of the three least important features after each loop, we reached our desired set of informative features (i.e., subsets of feature(s) ranging from 1 to 5).
The classifier fit the entire training data at every iteration, and coefficients for each feature were found. The attributes corresponding to the smallest coefficients were removed, and this process continued until only the desired set number of features remained. The performance metrics, namely accuracy, precision, recall, and ROC-AUC, were used to evaluate each machine learning algorithm's performance using several sets of most informative features. A pictorial representation of the classifier-RFE based methodology to screen the best subset of features for predicting the earlier stages of AD is depicted in Fig. 3.
On the other hand, the embedded methods merge the wrapper and filter methods' vital points by benefitting from the built-in feature selection process of certain machine learning algorithms.
Additionally, in the embedded methods, the feature selection and training processes are performed concurrently, as shown in Fig. 4. In the current study, we focus on the Least Absolute Shrinkage and Selection Operator (LASSO) or L1 regularization, an important example of an embedded technique to select the best subset of features significantly associated with the response variable. The LASSO approach shrinks the explanatory variables' coefficients with less or no discriminatory power to zero while selecting a subset of explanatory variables with non-zero coefficients [34], [35]. The selected explanatory variables represent the joint discriminatory power to separate MCI patients from cognitively normal subjects. Before feeding the MCI dataset to the LASSO algorithm, all the categorical variables, namely the explanatory and the response variable, of the MCI dataset, were one-hot encoded.
Moreover, the LASSO is a clear case of the penalized least squares regression with lambda (λ) as an L1penalty function. The tuning of the hyperparameter (penalty factor lambda) was performed during the cross-validation process. We applied the LASSO regression with 5-fold  cross-validation (CV) 200 times to select the optimal subset of discriminatory attributes. A pictorial representation of the LASSO or L1 regularization-based methodology to screen the best subset of features is depicted in Fig. 5.
The LASSO algorithm can be mathematically defined. Refer to equation ''2'': The above equation can also be rewritten. Refer to ''3 and 4,'' shown below: Subject to the following condition: The LASSO method's advantage is that it is easily interpretable since it shrinks the coefficients of the noninformative and correlated attributes exactly to zero. Moreover, most of the time, the LASSO method is preferred for feature selection when the dataset has a small number of observations and many features. Once the best subset of features is obtained, the pruned MCI dataset with the best subset of features was used to build the four classifiers-based models to classify MCI patients from normal subjects. The classifiers were evaluated based on the various performance measures to screen the best performing model.

C. CLASSIFICATION ALGORITHMS
Four machine learning algorithms, Linear SVM, LR, DT, and RF, were used as base classifiers for the ensembling process.

1) LINEAR SUPPORT VECTOR MACHINE (LBSVM)
The non-continuous categorical features of the MCI dataset were one-hot encoded, which essentially transforms each categorical feature with ''n'' categories into ''n-1'' binary categories, a format suitable for downstream estimators.
The algorithm can output the best possible separating hyperplane for the provided set of classes given labeled training data. In a dataset, a sample is represented by a p-dimensional vector, and the linear SVM algorithm tends to find a (p-1) dimensional hyperplane that separates the data point into a set of classes. Linear SVM is a fast, discriminative classifier that works on the maximum-margin hyperplane [36]. The maximum margin hyperplane represents the largest distance between the nearest data point of each category or class, and SVM tends to find this maximum-margin hyperplane to classify a sample data. Mathematically the maximum-margin hyperplane can be obtained as follows: Suppose we are given a dataset with a set of data points x i where i = 1, 2, . . . . .n. Here ''n'' is the number of data points. The dataset was categorized into two classes: a positive class denoted by y i = 1 and a negative class represented by y i = −1. We can find A hyperplane f (x) = 0, categorizing the data points in a dataset into the two classes. Refer to equation ''5.'' Here w is an n-dimensional vector, and b is a scalar constant is used to define the hyperplane In the linear SVM algorithm, the data points associated with the two-classes can be separated linearly. We can do this by two hyperplanes that separate the data points with no data points between the planes. The two-hyperplanes are called separating hyperplanes, and the distance between the nearest data point of each class bounded by the two hyperplanes is called a margin. The linear SVM tends to maximize the margin to improve the classifier's ability to categorize data points into specific classes. ''Equations 6-8 describe the two separating hyperplanes as follows'': The data points used to define the two-hyperplanes are termed as support vectors. According to equation 7, the margin obtained in a multidimensional framework is given by w −2 Therefore, the ideal hyperplane separating the data point into two classes can be achieved by solving the following optimization problem. Refer to the equation ''9 and 10'', respectively.

2) LOGISTIC REGRESSION (LR)
The LR is a supervised classification algorithm. The LR algorithm is based on the linear regression model. Refer to equation ''11'': The LR algorithm fits the training data to a logistic sigmoid function, as shown in equation 9, and predicts the target categorical dependent variable's probability [37]. The estimated probability of the target variable in LR varies from 0 to 1. Also, a threshold is set to classify a particular instance into a specific target class. Depending on the threshold, the obtained estimated probability is classified into a specific target class. The estimated predictive value for a given x i value can be interpreted as sample x i 's chances to be a member of a target class variable. Let us say, if the predicted value of a sample x 1 is >0.5, then classify the sample under the ''CDR > 0'' category else under the ''CDR = 0'' category. ''Equation 12, 13, and 14'' are the main equations of the LR algorithm are shown below'': This study has two categorical dependent variables, namely CDR > 0 and CDR = 0 groups. Here Y signifies the dependent target variable CDR > 0 group. While X in equation 11 represents the independent explanatory variable in the dataset. Every independent variable, X, is assigned a coefficient value β representing weight. Different weights represent the different correlations between variables X and Y . VOLUME 8, 2020

3) DECISION TREE (DT)
The DT is a supervised learning algorithm that can be applied to solve both classification and regression problems. Implementing a DT algorithm aims to build a model to predict the target variable's class label or value. A DT algorithm can be used for both categorical and continuous target variables. The DT classifies the instances in the dataset by segregating them down the tree from the root node to some leaf node, with the leaf node providing a decision point for the labeling test instances with the unknown class variable. The root node of a DT is the best performing predictor node from a set of nodes (attribute) present in the dataset. The representation of an atypical DT structure is shown in Fig. 6. There are various ways to select the root node based on the percentage of heterogeneity obtained by splitting the data using different attributes.
The performance measures, namely Gini indexes, Entropy, classification error, are calculated for each predictor, and a comparison is made to select the best predictor root node to start the splitting of the dataset. The DT algorithm recursively generates a split (decision node) on each subset of data, considering attributes that never have been selected before until it reaches a terminal node (leaf node) corresponding to a subset of data with maximum purity [38].

4) RANDOM FOREST (RF)
The RF is an ensemble learning algorithm that combines relatively uncorrelated tree predictors. Each tree in the RF provides a class prediction, and finally, the most voted class is the final prediction result (class) of the ensemble. A pictorial representation of the working of the algorithm is depicted in Fig. 7.
RF overcomes variance and overfitting by averaging the result of each independent tree predictors [39].

D. ENSEMBLE MODEL GENERATION
Many methods generate an ensemble of models, including stacking, bagging, boosting, voting, and averaging. As the dataset is small and an ensemble of strong learners is required, methods that minimize variances such as voting and averaging need to be selected. Since averaging outperforms voting [40], the trained LR and SVM classifiers were selected for average ensemble while the DT and RF classifiers were left out as the latter two were outer-performed by the former.
The idea behind an average ensemble is that among two or more models, some might perform better on certain samples while others might perform better on the remaining ones, and by taking their average, a single classifier is built that can generalize to all samples. The above statement can be empirically understood by checking the PCC of the models' output. Lower correlation signifies that the models perform well on different subsets of the data, while higher correlations will lead to an ensemble, which is only as good as their base predictor. Giving all models equal weightage in an average ensemble reduces variance and increases bias as poor performing models can have a regularization-like effect on the output. Reducing variance while not affecting a system's bias, the weighted average ensemble can give more importance/weight to better performing models in the ensemble.

1) WEIGHTED AVERAGE ENSEMBLE LEARNING
The key idea behind ensemble learning is to combine weak learning classifiers to generate a robust classifier. The final ensemble-based model provides better stability by reducing the individual weaker classifiers' error (bias, noise, and variance) [40]. In this study, an ensemble-learning model was created by training-cum-cross validating two hundred copies of the pruned dataset (i.e., with three features), and within each copy, a random train-test split of 80-20 is performed. Two-hundred copies of the MCI dataset were generated to average out results and get a more realistic estimate for the small dataset. All training and testing datasets were reduced to three features. Five-fold cross-validation over all of the 200 training datasets was performed, and the final results averaged to find the best classifier. An ensemble of the two best-performing algorithms was created by taking a weighted average of LR and Linear SVM. Where SVM's weight was set to 0.9 and that of linear regression was set to 0.1.
Since the output of linear SVM is not probability-based, therefore, its soft output was converted to probability via the logistic function. Refer to ''15'' stated above.

E. TRAINING-CUM-VALIDATION DATASET
The MCI dataset with 333 instances were segmented into 80% training data and 20% independent test data. The 80% data of the MCI dataset was used for performing 5-fold training-cum-cross-validation. The four supervised machinelearning algorithms (LibSVM, LR, Decision tree (DT), and Random Forest (RF)) cost-sensitive estimators (classifiers) with weights, 0.2 for class ''0'' and 0.8 for class ''1'' were used to neutralize the bias created by the majority class in the MCI dataset such that the false positive rate would not exceed a threshold of 20%. The cost-sensitive classifiers were trained-cum-validated on various feature subsets obtained from the RFE and L1 regularization feature selection method.
The total number of instances in the training data was 266, where 196 were cognitively normal samples (Negative class), while 70 instances were MCI patients (Positive class). Since binary classification is being performed, we need to set an optimum threshold for our base and ensemble classifiers. A simple binary search was employed with the initial lower, intermediate and upper values set to 0, 0.5, and 1.0, respectively. The objective was to find the threshold for which specificity was per Craig-Schapiro et al., 2012 [11]. Therefore, for each value of the middle variable, the model was evaluated 200 times under 5-fold cross-validation, and the average results were used to check the specificity value. The above-described process was done until specificity was found to be consistent with Craig-Schapiro [11]. The trainedcum-validated prediction model's performances were evaluated by comparing their sensitivity, accuracy, Precision Area under the curve (PR-AUC), and ROC-AUC values at specificity value in line with the earlier studies conducted by Craig-Schapiro [11].
Finally, after calibrating the model (s) using the best threshold value, we performed predictions on the test dataset. For brevity, other models were also trained and tested at our estimated optimum threshold, and a comparative assessment was made between our results and that of the model developed by Craig-Schapiro et al., 2012 [11].

F. INDEPENDENT TEST DATASET
In the current study, the trained-cum-validated models generated using the various subset of features obtained from RFE and the L1 regularization method were reevaluated using five-fold cross-validation on the 200 copies of 20% independent test data. The total number of instances in the test data was 67. Here the positive class (MCI patients) consisted of 21 instances, and the negative class (Cognitively normal subjects) consisted of 46 instances. It is recommended to do testing on independent data to eliminate the predictive modeling bias. Five-fold crossvalidation over all of the 200 copies of 20% independent test datasets was performed, and the final ensemble model was created by taking a weighted average of LR and Linear SVM.

G. MODEL EVALUATION METRICS 1) ACCURACY (ACC) AND CONFUSION MATRIX
The ACC is calculated as the total number of correct predictions (TP + TN) divided by the total number of instances in a dataset ((Positive + Negative) instances). The equation of accuracy is shown below. Refer to equation ''16'': An accuracy of 1.0 is considered the best, while ''0.0'' is considered the worst. Classification accuracy is an excellent statistical evaluator only when both positive and negative instances in the dataset are equal in number, i.e., the classes are balanced. On the contrary, if the classes are imbalanced, then classification accuracy gives us a false sense of the correctness (or accuracy) of any classification algorithm (classifier) [41]. Under such circumstances, calculating a confusion matrix [42], [43] can give us insight into the errors and the types of errors our classification model is making. It also tells us what predictions the model is getting right. Compared to the confusion matrix's actual outcomes, this breakdown of correct and incorrect predictions overcomes the limitations of using accuracy alone as a statistical evaluator for assessing a classification model's performance. Besides, if anyone desires to avoid false-positive more than false negative or vice versa, other statistical evaluators, namely Sensitivity (Recall or True positive Rate (TPR)) and Specificity, are more informative than accuracy [41].
The equation for determining sensitivity is depicted below. Refer to ''17'':

3) SPECIFICITY
Specificity (SP) is calculated as the number of correctly classified negative instances divided by the total number of negative samples in a dataset. The best specificity value is 1.0, and the worst is 0.0.
The equation for calculating specificity is depicted below. Refer to ''18'':

4) RECEIVER OPERATING CHARACTERISTIC CURVE (ROC)
The ROC is a curve generated by plotting the TPR (Sensitivity/Recall) against the True Negative Rate (TNR) (1-Specificity) at various classification or decision thresholds. The Area Under the ROC-AUC Curve value measures a classification model's quality to distinguish between two classes in a dataset. A random classifier has an AUC value VOLUME 8, 2020 of 0.5, and a perfect classifier that can correctly discriminate two classes in a dataset has an AUC value equal to 1 or vice versa.

5) PRECISION-RECALL (PR) CURVES
Similar to the closely-related ROC curves, the PR curves are an estimation tool for binary classification that helps us visualize various machine learning algorithms' performance at several classification thresholds ranging between 0 and 1. PR curves are used, particularly for imbalanced or skewed data sets, where one class's instances are observed more regularly compared to the other class. Therefore, on these skewed (imbalanced) data sets, PR curves are a suitable alternative to ROC curves that would show quite a massive performance difference between two algorithms that cannot be represented appropriately using the ROC curves [44]. In addition to visual evaluation of a PR curve, the Area under a PR curve (AUPR) is often used for assessing the performance of a machine learning algorithm, regardless of any particular operating point or threshold. The higher the AUPR value, the better the model predicts positive instances as positives (True positive (TP)). The AUPR values range from 0 to 1 with a score of 1 depicting a perfect model, i.e., the model can predict the entire positive instances in the dataset as True Positives with no false negatives and false-positive predictions.

H. STUDENT T-TEST
A two-tailed unpaired student t-test [45] was performed to estimate the significant difference (i.e., p-value < 0.05) between the mean value of the three features (Cystatin C, MMP10, and tau proteins) across two types of population (CDR 0 and CDR > 0). Likewise, a one-tailed unpaired ttest [45] was performed at a significance level of 0.5 to show that our model's performance is better than the model based on traditional biomarkers and features that achieved the best ROC_AUC result Craig-Schapiro et al. 2012 [11].

I. IMPLEMENTATION: WEB APPLICATION
The ensemble-learning based predictive model was developed using the pruned MCI dataset with three features.
The model was saved in the python pickle file format (.pk). Flask, a light-weight micro-web framework with Jinja2 templating, was used to develop the web application. Later, the Flask web application was deployed on Heroku, a cloud-based platform, to build and run web applications exclusively in the cloud. The output of the application is probability-based. Individuals with an output probability score greater than 0.5 have a higher probability of being diagnosed with AD than an individual with a score lesser or equal to 0.5.

A. BEST INFORMATIVE FEATURES SELECTION AND CLASSIFIER EVALUATION
The RFE employing cost-sensitive classifiers with weights was used to select the most informative set of model building features. The subset of features obtained using RFE based FS method is tabulated in Table 1. The Comparative performance evaluation of different sets of features listed in Table 1 was trained and tested on four different cost-sensitive classifiers (RF, LR, LibSVM, and DT), are shown in Fig. 8(a-d). Fivefold cross-validation over all of the 200 training datasets was performed, and the final results averaged to find the best classifier and the best subset of features. As per the comparative performance evaluation results, as shown in Fig. 8(ad), the LR and linear SVM model built using a subset of three features (Cystatin C, MMP10, and tau proteins) performed well in discriminating mild cognitive impaired (CDR > 0) from cognitively normal subjects with better sensitivity PR-AUC and ROC-AUC at specificity in line with Craig-Schapiro et al., 2012 [11]. Further, a comparative average performance evaluation of RFE-classifier based models built using a subset of the three most informative features in discriminating cognitive normal (CDR 0) from cognitively impaired (CDR > 0) subjects are shown in Table 2.
The probability distributions of the three proteins (Cystatin C, MMP10, and tau proteins) between the two different populations (CDR 0 and CDR > 0) are shown via histogram plots in Fig. 9(a-c).  The p-value of the two-tailed unpaired t-test for Cystatin C, MMP10, and tau proteins selected using the LinearSVM classifier based RFE method is tabulated in Table 3.
The observed p-values of the three selected protein biomarkers were lower than the significance level of 0.05. Therefore, the mean population of selected attributes between cognitively impaired and cognitively normal groups VOLUME 8, 2020    is significant. Additionally, the LASSO algorithm was also applied for parameter estimation and feature selection. The selected subset of features with their coefficient and Mean Square Error (MSE) along with the tuned Lambda hyperparameter value is tabulated in Table 4.
The feature importance of the optimal set of features selected using LASSO is represented in Fig. 10. A set of four cost-sensitive classifiers (RF, LR, LibSVM, and DT) based models built using the optimal subset of features GRO_alpha, tau, Cystatin_c, VEGF, and Aβ42) obtained via L1 regularization were evaluated based on the following performance measures: Accuracy, ROC-AUC, PR_AUC, and sensitivity.
Comparative performance evaluation of the four classifierbased models is represented in Fig. 11. We can observe from Fig. 11 that the cost-sensitive LR classifier based model with an accuracy (accuracy (0.8864 ± 0.098), sensitivity (0.8663 ± 0.048), PR_AUC (0.8266 ± 0.081), and ROC_AUC (0.8966 ± 0.054) outperform the other classifier based models. Comparative performance evaluation of the two best models generated using RFE and L1 regularization at a 5% significance level is tabulated in Table 5.
It can be observed from Table 5 that the Linear SVM-RFE based model's performance built using a feature subset of three features (Cystatin_C, MMP10, and Tau) performed significantly better than the LR-based model built using a subset of features generated using L1 regularization at a 5% significance level. Finally, the MCI test dataset was pruned based on the selected subset of informative features (Cystatin C, MMP10, and tau proteins).

B. ENSEMBLE-LEARNING MODEL BUILDING AND OPTIMIZATION OF THRESHOLD
The PCC score on the outputs of LR and SVM was found to be 0.63, signifying that an average ensemble would result in a predictor which is better than the individual models. A weighted average of LR and Linear SVM's soft output obtained from the classifier-RFE method was used to create an ensemble. When running on 5-fold cross-validation overall 200 training dataset, the ensemble model's weighted-average results show that the threshold value of 0.321 is optimum for correctly classifying instances in the dataset. Finally, our calibrated ensemble model was tested on the 200 copies of 20% independent test datasets using the optimum threshold value, and the weighted-average out results of each class are shown in Table 6. Given the tested model's overall prediction, the weighted average ensemble model's accuracy was 0.9552 ± 0.025.
The performance of the ensemble model in terms of the confusion matrix is shown in Fig. 12. A comparative evaluation of the ensemble model with the base classifier is tabulated in Table 7.
The sensitivity, PR_AUC, and ROC_AUC values of the ensemble model were observed to be considerably better than the individual best-performing classifiers (Linear SVM and LR) at a 5% significance level as shown in Table 7.

C. COMPARATIVE PERFORMANCE EVALUATION
For brevity, our calibrated ensemble model was trained and tested on datasets generated using the state-of-

TABLE 5.
Comparative independent two-sample t-test between model built using SVM-RFE and model built using features selected using LASSO method at a significance level of 5%.  the-art features (tau, tau, and Aβ_42) as well as features that achieved the best ROC_AUC results for Craig-Schapiro et al., 2012 [11]. A comparative performance assessment of the models, as mentioned above, at a 5% significance level, is tabulated in Table 8.
It can be observed from Table 7 that the calibrated weighted average ensemble model's performance built using a feature subset of three features (Cystatin_C, MMP10, and Tau) performed significantly better than the model built using the state-of-the-art features (tau, tau, and Aβ_42). Additionally, our proposed ensemble model also performed significantly better at a 5% significance level than the model built using a combination of CSF protein biomarkers that achieved the best ROC_AUC results for Craig-Schapiro et al., 2012 [11].
Histogram plots of the ROC_AUC scores generated from each of the 200 runs are used to approximately depict the difference in the probability distributions between models built using conventional CSF biomarkers, the best ROC_AUC biomarker proposed by Craig-Schapiro et al., 2012 and our proposed CSF protein biomarkers are shown in Fig. 13 and Fig. 14, respectively. VOLUME 8, 2020 TABLE 7. Performance evaluation of the calibrated weighted-averaged ensemble model at a 5% significance level in discriminating cognitive normal (CDR 0) from cognitively impaired (CDR > 0) subjects.

TABLE 8.
Comparative averaged out performance evaluation of models at a 5% significance level in classifying cognitive normal (class 0) from cognitively impaired (class 1) participants.
The histogram plot is also used to estimate a statistically significant change between the models.
A one-tailed unpaired t-test was used to estimate how our model is better than models generated using traditional and best ROC_AUC value attributes from studies conducted by Craig-Schapiro et al., 2012. The p-value of the one-tailed paired t-test for the comparative evaluation of our proposed ensemble model and models generated using traditional CSF protein biomarkers as well as biomarkers which achieved higher ROC_AUC value from Craig-  Schapiro et al., 2012 was observed to be 4.3002 × 10-36 and 2.398 × 10-38 respectively which was much lower than the significance level of 0.05.

D. PREDICTIVE WEB-APPLICATION IMPLEMENTATION
The web-based predictive application based on a novel combination of three CSF protein biomarkers to predict the earlier stages of AD has been made live on Heroku as ''https://appsalzheimer.herokuapp.com.''.

IV. DISCUSSION
The MCI dataset was downloaded from Figshare [30]. The clinical dataset consisted of 91 MCI subjects and 242 cognitively normal subjects. The subjects' cognitive impairment status in the MCI clinical dataset was as per the CDR scale ranging from 0 to 1, where CDR 0 was considered a cognitively normal subject, while CDR 0.5 and CDR 1 were considered very mild and Mildly cognitive subjects, respectively. The raw clinical dataset consisted of 124 features encompassing CSF protein biomarkers, allelic variants of Apolipoprotein E genotype (E2, E3, and E4), and demographic features, namely gender and age. All categorical features were transformed into a binary-encoded format by using the one-hot-encoding technique. All duplicate instances were removed as well. This study focuses on the two feature selection methods, namely, the RFE method, an important example of the wrapper-based feature selection method, and the LASSO method based on an embedded method of feature selection. Since there is a class imbalance in the dataset, in the RFE based method of feature selection, cost-sensitive classifiers, namely, RF, DT, LR, and LinearSVM, with a series of feature sets containing features ranging from 1 to (1+ n) were selected. Here, ''n'' varies from 0 to 4. Likewise, the number of attributes removed at each iteration in RFE was set to be three features.
A subset consisting of three best-performing features, namely, Cystatin C, MMP10, and tau, were generated from the cost-sensitive SVM-RFE and LR-RFE based feature selection process. The Linear SVM and the LR-based model built using three features outperformed other classifier-based models built using the other subset of features generated using the classifier based RFE method. The LR-based model built using the five most important features selected using L1 regularization performed better than the RF, DT, and Linear SVM-based models. However, the LR-model performance built using features screened using L1 regularization was significantly (p < 0.5) lower than the LinearSVM based model built using a subset of three most informative features screened using the classifier based RFE feature selection method. Furthermore, we observed a significant difference between the mean of all three protein biomarkers (Cystatin C, MMP10, and tau proteins) derived from the RFE method among subjects belonging to cognitively normal (CDR 0) and Mild cognitive Impaired (CDR > 0) groups at a significance level of 0.05.
The candidate protein biomarker (Cystatin C, MMP10, and tau proteins) screened in this study belongs to different pathways and a functional group whose association with AD pathophysiology has been investigated and documented, most notably, tau protein, a traditional CSF biomarker which has proven useful in the diagnosis and prognosis of AD. In some individuals with a very MCI, CSF tau levels have shown gradual increase years before being ultimately diagnosed with AD [15]- [17]. Therefore, the tau protein has been useful for predicting AD's onset in individuals with very mild or mild cognitive disorders. Cystatin C has a pivotal role in AD pathophysiology as CysC concentration modulates Aβ amyloidogenesis and oligomerization [46], [47]. MMPs play a vital role as an inflammatory element in AD's disordered physiological process. MMPs transcription is induced through posttranslational modification by the inflammatory mediators (e.g., free radicals or cytokines) and inhibitor proteins (e.g., Metalloproteinase inhibitors (TIMPs)). The activated MMPs remodel the pericellular environment by regulating the extracellular matrix (ECM) and the tight junction's breakdown. The MMPs also interact and alter the properties of growth factors, cell surface components, and signaling molecules, leading to neuroinflammation, cell death, and neurotoxicity [48]. In this context, separate studies conducted by Duits et al., 2015 [49] and Whelan et al., 2019 [50] showed a significant increase in MMP-10 in AD-dementia Aβ+ MCI patients as compared to cognitively normal individuals. The altered level of MMP-10 in AD and MCI patients suggest the involvement of MMP-10 protein in the pathology of AD and, thus, possible use of MMP-10 as a protein biomarker for earlier diagnosis of AD. So, the implementation of the screened CSF biomarkers, whose association with the pathology of AD is well studied, may improve the reliability of models built to predict AD's earlier stages.
The classifiers' comparative performance evaluation using different subsets of features obtained using the RFE and LASSO method generated a consistently accurate model. However, RFE based LR and Linear SVM model achieved the best results, with a subset consisting of the three most informative subset features. Therefore, an ensemble model of the two best performing classifier was generated. Since the ensemble is probabilistic by nature, it requires a threshold for producing binary output. Thus, the ensemble was calibrated to obtain an optimum threshold for predicting the earlier stages of AD. Our ensemble model's weighted average out results were significantly better than the individual performance of LR and Linear SVM base classifiers tested on the MCI dataset in terms of sensitivity, ROC_AUC, and PR_AUC values. The weighted average result of our ensemble model was compared to models generated using traditional combinations of CSF protein biomarkers (Tau, p-tau, and Aβ-42) and biomarkers, which achieved the best ROC_AUC result for Craig-Schapiro et al., 2012 [11] at a 5% significance level.
Our novel combination of CSF protein biomarkers based ensemble model performed significantly better in classifying cognitively impaired subjects (Positive instances) from the MCI dataset compared to models generated using the state-of-the-art and best-performing protein biomarkers obtained from a study conducted by Craig-Schapiro et al., 2012 [11]. Lastly, recent studies by Spellman et al. [51] and Llano et al. [52] examined a clinical Alzheimer's disease Neuroimaging Initiative (ADNI) dataset and proposed a peptide signature analyte-based multivariate model to predict the earlier stages of AD. The study conducted by them found several overlapping potential signature peptides, namely Aldolase A, Fructose-Bisphosphate peptide (ALDOA), Fatty Acid-Binding Protein, Heart (FABPH), Neuronal pentraxin receptor (NPTXR), Peroxiredoxin-1 (PRDX1), and Neurosecretory protein VGF (VGF), for predicting the earlier stages of AD and future disease progression. The multivariate modeling approaches proposed by Spellman et al. [51] and Llano et al. [52] were able to differentiate AD from Non-AD subjects with a ROC_AUC of 0.74 and 0.89, respectively. Conversely, our ensemble model performance with a ROC_AUC value of 0.9499 ± 0.055 performed better in classifying cognitively impaired subjects from cognitively normal subjects than the traditional combination of CSF biomarkers and signature peptides from Spellman et al. [51] and Llano et al. [52] studies, respectively.
Additionally, our ensemble model has shown considerably better ROC_AUC performance than the multiple marker studies in the past involving multiple imaging modalities, neuropsychological testing, and APO_E genotype to predict the earlier stages of AD [24], [53]. Considering the higher AU-PR and the AUC_ROC attained by our model, we can say that the profiling of our novel combination of CSF protein can be recommended for clinical tests for predicting the earlier stages of AD. Our web-based application to predict the earlier stages of AD has been successfully implemented and has been made live on Heroku at ''https://appsalzheimer.herokuapp.com.''.

V. CONCLUSION AND FUTURE SCOPE
The current proposed model uses weighted average ensemble-based learning methods to build a predictive model that can easily discriminate MCI subjects from healthy subjects with higher sensitivity, ROC_AUC, and PR_AUC value. On the other hand, the multiple marker studies are costly and impractical to attain all of these biomarkers from a single patient. Thus, our contemporary ensemble approach's unique benefit is its cost-effectiveness, as the profiling of our novel combination of CSF protein can precisely classify patients with earlier stages of AD. Also, we built a web-based live predictive system built using a novel combination of CSF protein biomarkers, which is the first of its kind online service for predicting the earlier stages of AD. We believe that such our application built using the most informative novel combination of features will benefit researchers and doctors in diagnosing very mild or Mild CI disorders, thereby assisting AD's earlier prediction. In 2008, he started his career as a Software Developer with Sutraa Pvt., Ltd., Delhi, India. He is currently serving as a Lecturer for the Faculty of Computing and Information Technology, King Abdulaziz University, Rabigh, Saudi Arabia. He is also an Excellent Teacher and a Talented Researcher with more than seven years of teaching and research experience in machine learning, bioinformatics, Web technology, and image processing. He has produced many publications in the Journal of International Repute and presented articles at International conferences. His current research interests include deep learning medical informatics and machine learning.
Mr. Khan is a member of the International Association of Engineers (IAENG) and a member of the following societies, The IAENG Society of Bioinformatics, The IAENG Society of Computer Science, and The IAENG Society of Data Mining.
ATIF HASSAN received the master's degree in computer science and engineering from the Indian Institute of Technology Kharagpur, where he is currently pursuing the Ph.D. degree from the Centre of Excellence in Artificial Intelligence. His research interests include natural language processing, data mining, machine learning, and deep learning in general. He loves sinking his teeth into new, real-world problems, and regularly contributes to the AI community through blogs, research, and python package releases. He is also very active in ML competitions, securing top ten ranks on numerous occasions. In his free time, he works on new ideas and helps all those who reach out to him for guidance in the fields of ML, NLP, and DL. In 2009, he started his academic career as a Lecturer with the Faculty of Computing and Information Technology-Rabigh (FCITR), King Abdulaziz University, Saudi Arabia, where he is currently serving as an Assistant Professor. He is also an Excellent Teacher and a Talented Researcher with more than 11 years of teaching and research experience in database, computer science, artificial intelligence, and machine learning. He has produced many publications in the Journal of International Repute and presented articles at International conferences. His main research interests include database analysis, design and modeling, temporal database models, temporal data mining, time-varying medical data, and image encryption using chaos encryption scheme. He has authored many research articles in data modeling of timevarying data and image encryption in international journals and conferences.
MUHAMMAD BINSAWAD received the master's degree in applied information technology and the post-baccalaureate certificate in information systems management from Towson University, USA, and the Ph.D. degree in information systems from the University of Technology Sydney (UTS), in 2019.
He has professional Information Systems Design, Digital Transformation, Business Analysis, IT Project Management, and Solid knowledge in the SDLC approach. He has taught several subjects in information systems and software engineering at King Abdulaziz University and the University of Technology in Sydney. He is currently an Assistant Professor with the Department of Computer Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University. His research interests include and are not limited to information systems modeling-services, digital transformation human-computer interaction (HCI), and empirical studies. He is also actively involved in international research and events activities and contributing to international conferences and journals.
ALHUSEEN OMAR ALSAYED received the master's degree in information technology from the Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia. He is currently working as a Scientific Researcher with the Deanship of Scientific Research, King Abdulaziz University, Jeddah, Saudi Arabia. He is also a certified Trainer and a Professional Practitioner from KAU and Technical and Vocational Training Corporation, Saudi Arabia. His research interests include E-learning, cloud-based E-learning, collaborative learning, social networking sites, machine learning, and other related topics. He has published many articles in refereed/indexed international journals and conferences /MDPI/ IEEE ACCESS and Taylor & Francis Group. He has been appointed as a Reviewer of IEOM GCC Conference and IEEE. He is also actively involved in international research and events activities and contributing to international conferences. VOLUME 8, 2020