Optimized One vs One Approach in Multiclass Classification for Early Alzheimer’s Disease and Mild Cognitive Impairment Diagnosis

The detection of Alzheimer’s Disease in its early stages is crucial for patient care and drugs development. Motivated by this fact, the neuroimaging community has extensively applied machine learning techniques to the early diagnosis problem with promising results. The organization of challenges has helped the community to address different raised problems and to standardize the approaches to the problem. In this work we use the data from international challenge for automated prediction of MCI from MRI data to address the multiclass classification problem. We propose a novel multiclass classification approach that addresses the outlier detection problem, uses pairwise t-test feature selection, project the selected features onto a Partial-Least-Squares multiclass subspace, and applies one-versus-one error correction output codes classification. The proposed method yields to an accuracy of 67% in the multiclass classification, outperforming all the proposals of the competition.


I. INTRODUCTION
Assistance in diagnosis of Alzheimer's Disease (AD) in it's early stages has received a constant interest from the medical imaging community in the past decades due to its medical importance and societal implications [8]. Distinguishing between AD and its related neurological disorders, including its prodromal stage Mild Cognitive Impairment (MCI), is very challenging from the clinical evaluation point of view, predominantly in the early stages of the disease. Medical imaging techniques, such as Magnetic Resonance Imaging (MRI) or Positron Emission Tomography (PET), have provided new tools to assess the subject conditions in a non-invasive way. However, both the subtle changes produced in the brain and the lack of a complete understanding of the disease development still pose challenges to the The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . diagnosis assistance through brain images [45], [47]. Concretely, the discrimination between MCI and AD has been shown to be a difficult task [24], [26], [30], [41], [42]. Machine learning applications in neuroimaging have become an indispensable tool for brain image analysis and computer aided diagnosis (CAD) systems, producing a prolific area of research [8]. However, the lack of standardized datasets hinders direct comparisons of approaches, and the identification of their virtues.
The open data policy and the creation of big databases have facilitated the organization of competitions for improving the CAD systems for AD diagnosis, such as CADDementia [3] and TADPOLE (https://tadpole.grand-challenge.org/), among others. The availability of the data for posterior analysis allows for retrospective analysis of the competitions and new submission proposals. It also facilitates the reproducibility of results, a problem that is becoming of central interest in neuroimaging research [4]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This work makes use of the International challenge for automated prediction of MCI data. The objective of the competition was the development of CAD systems for the multiclass classification of 4 classes: Healthy Controls (HC), Mild Cognitive Impaired (MCI) subjects, Mild cognitive impaired subjects that converted into Alzheimer's Disease during the study (cMCI), and Alzheimer's Disease (AD) patients. The challenge provided with preprocessed MRI data of the different classes to allow participant proposals of optimized CAD systems, based on the finding that combining multiple anatomical measures improves classification of early diagnosis of AD [44]. The results of the challenge were published in the special issue [38] on the Journal of Neuroscience Methods. The winner proposal used a random forest ensemble with feature extraction methods [10], yielding to a 61 % accuracy in the multiclass classification problem.
Ensemble methods have been successfully applied to neuroimaging problem [7], [20], [32], [33]. It has also been proven that multiclass approaches using binary classifiers can be a competitive solution to the problem, such as those based on one-versus-one approaches, one-versus-rest in the context of Error Output Correcting codes (ECOC) [6], [9], [13], [46]. This study uses and compares ensemble methods and aggregation methods by binary classifiers, as those of highest rated approaches in the challenge [5].
To optimize the multiclass classification through combination of anatomical features, not only classifier aggregations are necessary, but also feature extraction techniques [36]. Different approaches to feature selection and extraction reported high relevance in the literature, with high accuracy and also a correspondence between automatic cortical and subcortical region selection and clinical findings [7]. Concretely, brain atrophy has been found to be relevant for AD diagnosis in white matter cortical and subcortical regions, as well as hippocampal volume, cortical thickness, and grey matter density, thus making feature extraction a reasonable preprocessing step (see [44] and references therein). One of such successful methods for feature selection and combination is Partial Least Squares, linearly transforming the data into a space maximal separation between classes [25], [35], [39], [40].
Through the extensive use of feature extraction and ione-vs-one feature selection and classification, we propose and study a CAD system for identification of early stages of AD and MCI that optimizes the combination multiple anatomical measures of atrophy in the brain to improve classification performance.

II. METHODS
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. For up-to-date information, see www.adni-info.org.

A. DATASETS
This section shows the datasets that were provided for the International challenge for automated prediction of MCI from MRI data (https://inclass.kaggle.com/c/mci-prediction). MRIs were selected from the ADNI and preprocessed by Freesurfer (v5.3) [14], [15]. In total 429 demographical, clinical as well as cortical and subcortical MRI features were available for each subject. Fig. 2 shows the average values for different regions across the brain for the four available classes.
Two different datasets were provided for training and testing the proposed methods for automated prediction of MCI from MRI data. According to their diagnosis, patients were grouped into four classes: healthy control (HC) subjects, AD patients, MCI subjects whose diagnosis did not change in the follow-up (MCI) and converter MCI (cMCI) subjects that progressed from MCI to AD in the follow-up of the disease. The training dataset consisted of 240 ADNI real subjects (60 HC, 60 MCI, 60 cMCI and 60 AD). Demographic information is shown in Tables 1 and 2. The testing dataset consisted of 500 subjects. 160 out of them were real subjects, whereas the 340 remaining subjects were artificially generated from the real data. Table 2 shows demographic information of only the 160 real patients excluding 340 dummy subjects in the testing dataset. No information about the class labels of the test set was available during the competition. The test set was half split into public and private test sets and only the accuracy score on the public dataset was available for competitors until the challenge ended. Once the challenge finished, class labels for the subjects on the test set were provided to the competitors. The accuracy score on the real subjects of the testing set was used as the figure of merit in the competition.

B. WORKFLOW
The methodology followed in this work aims at optimizing the binary classification of the different classes, HC, MCI, cMCI and AD in the multiclass classification problem, so that the overall classification performance is increased. To that aim, the process is divided in four steps as depicted in Fig. 1. A first preprocessing step is applied to discard outliers and standardize the data. Secondly, based on the observation of the existence of irrelevant or redundant data, a filter is applied VOLUME 8, 2020 to eliminate unimportant features. Once a set of features is selected, a combination of statistical tests and Partial Least Squares (PLS) techniques are used to extract features at binary level for each one vs. one classification (HC vs MCI, HC vs. cMCI, etc..). Finally, binary classifiers are trained on these data, and an aggregation method is proposed for achieving a final multiclass decision.

1) PREPROCESSING
The presence of outliers is usually an undesirable source of instabilities for machine learning applications. In neuroimaging, outliers are specially challenging as they are frequently found due to acquisition, scanner differences, preprocessing artefacts or resulting from large intrinsic inter-subject variability, having a dramatic effect on the statistical based analysis [17].
A carefully analysis of the data reveals a high abundance of outliers on each of the 429 data features. In Fig. 3, the presence of outliers is depicted for selected features as measured in standard deviations, with values exceeding 8 times the standard deviation.
A common preprocessing step in machine learning consists of centering the data to zero mean and one standard deviation values, usually known as z-score values [19]. However, as Fig. 4a) shows, the z-score values of the data contain high salt-and-pepper type noise. We attributed this effect to a miss-transformation of the data format, coming from freesurfer software, as described by the challenge organizers. The outlier correction algorithm is described in box 1. The algorithm results are shown in Fig. 4b). Correcting this format defect reveals a different data structure, with high redundancy, justifying posterior steps. Concretely, the features sets 1-35, 45-73, 71-139, 140-277, 278-347, 348-413, 413-429, seem to contain very low inter-patient variability, suggesting that the feature space dimension can be highly reduced. In posterior sections, we show that feature space dimension can be optimally reduced to a value below 20.

2) FEATURE SELECTION AND EXTRACTION
The preprocessing step is followed by the elimination of irrelevant features and the extraction of features for classification. The former is a filter, in a one vs. one approach. The features are sorted according to a specific criteria, thus eliminating the features with the lowest relevance. The latter is achieved under a multiclass PLS transformation of the selected features, reducing the feature space dimension [21].
The sorting criteria is based on binary comparisons between classes: HC vs. MCI Replace outlier value by D(i, j) * 1000; end end end comparison a t-test is performed for each feature, and the features f i are sorted according to their value of the t statistic, t i , i = 1, 2, . . . , 429. A 6 × 429 matrix S of sorted features is generated in this process. From this matrix S, a submatrix T is constructed by eliminating the n last columns, ordered by decreasing value of the t statistic. The number of times m a feature appeared in the matrix T was calculated for each feature. The parameter m is a significance measure for each feature and is constrained: 0 ≤ m ≤ 6. All the features with a value of m under a fixed threshold R where filtered out, resulting in a feature selected set S containing the most relevant features for all the individual comparisons: The parameter n was fixed by cross validation, and the parameter R has six possible values, being R = 3 a reasonable compromise between very restrictive and non-existent filter.
The feature set S R selection is followed by a PLS-based feature extraction. Following the development presented in [37], we will consider the problem of modelling the relationship between two sets of data using PLS. Let X ∈ IR N and Y ∈ IR M be two multidimensional spaces of variables, PLS models the relationship between them by score vectors. After making n observations of each space, PLS decomposes the matrix X(n × N ) ∈ X of zero-mean variables and matrix Y(n × M ) ∈ Y of zero-mean variables as follows: where X is the training data matrix, Y is a labels matrix, and T and U are matrices (n × p) formed by the p score vectors extracted (components, latent vectors where Cov(t, u) = t t u/n denotes the covariance sample of the score vectors t and u. This last feature extraction step produces a transformed data matrix D t . The PLS transformation maximizes the separation between classes in the new space, and can also be used to reduce the feature space dimension by selecting a reduced number of PLS components.
Apart from this last technique, the use of an autoencoder [23] was tested for the same purpose as PLS, in order to use the data generated at the encoder output as low-dimensional data input in the classifier. However, its explanation will not be extended since it is not finally used in the CAD pipeline due to worse results than PLS for this dataset.

3) CLASSIFICATION
A simple solution to the multiclass classification problem is to build binary classifiers and combine them. Classical aggregation techniques of binary classifiers in multiclass problems are usually based on the error correcting output codes (ECOC). Given the multiclass classification problem on N classes, the simplest example is the one-vs-rest model, where the output code is generated by N binary classifiers that exhaust all possible one class versus the N − 1 rest of the classes classifications. After that, a decoding algorithm is used to assign a final class to each generated output code. Considering the output as a length N codeword, the decoding algorithm can be modelled as a communication problem, where the class information is being transmitted [13].
We consider here the following optimized approach to the multiclass classification: a binary classifier is trained on the K s = N (N − 1)/2 one-vs-one individual classification tasks using the transformed data matrix D t , producing a six-bit codeword output for each sample. This output is aggregated to produce a final prediction on the test sample, defined as a decoding process in ternary ECOC algorithms, taking values on the four possible classes: HC, MCI, cMCI and AD. The Hamming decoding [9] is used to map each possible codeword into a single output class as HD(x, y i ) = K s j=1 1/2(1 − sign(x j y j i )). The justification of this choice lies on the fact that the classes are nested. Concretely, the MCI class is considered as an early stage of AD, although not free of controversy [11], [31]. In any case, the class cMCI is an early stage of AD, and thus can be considered as a subclass of the AD class. For this reason, a ternary ECOC with three possible symbols allows for reduction of the non-relevant class influence in the codeword coding and decoding, and thus managing the possible errors arising from the difference on binary classification accuracies.
Different classifiers are used to perform the individual binary tasks: support vector machine (SVM), including the use of kernel methods [43], nearest neighbours (NN) and decision trees, using different ensemble techniques: bagging [1], boosting [16] and random forest [2]. Moreover, deep learning techniques are also used, such as multilayer perceptron (MLP) and convolution neural network (CNN) [28], for reference and comparison to other published results of the challenge [10], [34]. For evaluation purposes and following the results found in [22], an upper bound can be set on the actual risk based on the re-substitution estimation of the empirical error |P act (f (x)) − P emp (f (x))| ≤ γ emp for any classifier f (x) at a confidence level η given by:

III. RESULTS
To estimate the performance of the proposed method, together with the parameter fitting, two strategies were employed in this paper. A 10-fold cross validation strategy and the re-substitution estimation of the actual error on the training set. Once the parameters were optimized, the test set was used to estimate the accuracy, recall and F1-score.
Regarding the parameters of (1) and the number of PLS components, a grid search strategy was employed. Fig. 5 shows the accuracy results on the training set for each pair (number of PLS components, number of features), where the number of features is selected by order from the pool of S R , affected by the value of n. It can be claimed that a wide range of values around 10 PLS components and 10 selected features produce competitive classification results, whereas a choice of PLS components above 3 and below 20 is also a good compromise independently from the number of features selected. This can be related to the robustness of the method.
The grid search results indicate that a reduced number of selected components, around 23, is optimum for classification results, in combination with a reduced number of PLS components (around 13). The regions that were selected are depicted in Fig. 6, excluding the MMSE score and age, also selected. This optimum set of features was processed as described in section II-B2, providing the training and test data for classification. Table 4 summarizes the classification results on the training and testing sets for the different classifiers. The results are also summarized for the test set excluding the dummy subjects. SVM outperforms every other classifier, and linear kernel provides slightly better performance than non-linear kernels. However, even the simplest 1-NN classifier provides very competitive results if related to the challenge results. Challenge results are summarized in tables 5 and 6. The competitive performance of every classifier is a sign of the preprocessing importance, revealing that the feature selection and extraction provide a very relevant set of features. Furthermore, the fact that more complex techniques are the ones with the lowest performance, such as CNN and deep learning algorithms in general, is consistent. It is mainly due to the use of such a low number of subjects and features, since these techniques are especially focused on problems with a large and high-dimensional dataset.    Table 3 summarizes the linear SVM classification results obtained following the proposed aggregation method. F1-score and recall are also reported during the training and test phases. The results are detailed for each class: HC, MCI, cMCI and AD. As expected, AD and cMCI are the classes with highest recognition values, whereas MCI report recognition rates slightly over random classification during training, but improved values on test. Overall, recognition rates are several percentage points over the challenge winner approach, outperforming every proposal of the challenge in the partial ranking (Table 5) and in the final ranking (Table 6).
A study about the control of the family-wise error (FWE) rate in our CAD system was performed based on the re-substitution estimation. A dataset of HCs containing 100 samples, 60 from the training set and 40 from the test set (without dummies), was used. The dataset was randomly divided into two subsets of 50 subjects each throughout 1000 iterations. Then, the re-substitution estimation was evaluated under the null hypothesis that the actual risk was equal to 0.50 (no group difference in the feature set should be true), where the number of PLS components (dimensions) was chosen equal to 1. The re-substitution accuracy obtained was equal to 0.612 with a standard deviation of 0.037. The upper bound associated to this configuration is equal to 0.136, with a significance level η = 0.05 as shown in equation 5, thus the actual risk is then at most 0.523 ± 0.037. As a conclusion, we cannot reject the null-hypothesis in the test.
On the other hand, If a 10-fold cross validation strategy is tested instead of re-substitution, an accuracy of 0.583 ± 0.06 is obtained with 13 PLS components. Although we can reject the null hypothesis with the current test, it is possible to VOLUME 8, 2020 FIGURE 6. Selected regions after one vs. one t-test feature selection. not rejecting the null hypothesis with a confidence interval of 0.10, a value higher than the usual one of 0.05, but interesting in the neuroimaging field [12]. The last verification of the significance of the selected features is assessed using the actual risk [18]. Following (5), we calculated the upper bound considering the final 13-dimensional dataset in each one-vs-one classification. The sample size in each comparison is 200 that is, by combining both training and test (without dummies) sets, a total of 200 subjects are considered in a two-class analysis.
With η = 0.05, the upper bound is equal to 0.343. Table 7 shows empirical errors and actual risks of each one-vs-one comparison according to the upper bound. HCvsMCI and MCIvscMCI actual risks are above 0.50 at the worst case, VOLUME 8, 2020 TABLE 7. Actual risk associated to each one-vs-one classifier using the selected and extracted features (13) and SVM by resubstitution (200 samples).
which means that the selected features cannot be accepted as significant to classify these conditions at the given significance level. Nevertheless, the difficulty of separating these conditions is well-known, thus in general terms a high relevance of the selected features is observed.
An additional study was conducted to detect the relationship between the actual error and dimensionality used in each classifier. Fig.7 shows that the actual risk is never less than 0.50 in the HCvsMCI comparison, whilst in the MCIvscMCI classification the number of PLS components needed for achieving that condition is less or equal to 6. Nevertheless, the use of 6 PLS components would only decrease the overall accuracy down to 60%.

IV. DISCUSSION
The results presented reveal the importance of feature selection and extraction in the CAD pipeline of this challenge problem. Concretely, the default sorting of the 429 cortical thickness values provided by the organizers of the challenge seem to contain already important information for classification. The best results are obtained by removing some of the original set of default sorted features, by applying the proposed minimum filter. After the feature selection by means of the filter in (1), it is relevant to emphasize that there is not an equilibrium between right and left hemisphere regions. Specifically, there is a dominance of left-sided hemisphere regions, which is coherent with recent findings in CAD diagnosis of AD [29]. If compared with other competent CAD proposals of the challenge, such as the winner Dimitriadis-Liparas (DL) proposal [10], there is a significant overlap with the feature extraction results. The DL approach also results in a left-hemisphere regions predominance. Therefore, a successful feature selection and extraction method is critical for optimal performance.
The use of t-tests as sorting criteria for filtering features and the upper-bound tests for assessing the feature relevance are justified on the basis of the Kolmogorov-Smirnov (KS) and the upper bound tests [18]. Moreover, the KS test quantifies the departure of the empirical distribution function of the features from a cumulative distribution function of a particular statistical distribution. In this case, the assumption underlying a t-test implies that the feature values follow a normal distribution, which is an acceptable assumption in the light of the results of the KS test presented in Fig. 8. It is important to stress that direct comparisons between different t i values at different tests are never performed, but the values of t i are used for feature sorting. In addition, we employed a novel approach [18] for testing relevance in a set of features based on a data-driven approach (agnostic or free-parameter model). The latter is based on the re-substitution error estimate and the theoretical upper-bound of the empirical errors that provide a confidence interval for performing hypothesis testing.
The present methodology can be applied in other multiclass classification problems, in which there is a hierarchy and overlap between classes. Furthermore, the computation of the final selected algorithm is fast, which is an advantage over tested deep learning techniques, which require a longer processing time associated with network training.
Concerning the limitations on the present work, the preprocessing of outliers reveal a high redundancy on the original data. Therefore, the pre-selection of brain regions for the challenge, and the extraction of cortical thickness affects the maximum achievable performance in several ways. Firstly, the limited number of training samples makes statistical estimations prone to bias, a widely known-problem in medical imaging [4], [22]. The cross validation technique used in this work for performance estimations can be considered as ''pessimistic'' [27], and therefore some mismatches between training fitting and final test estimations can be expected, limiting the capabilities of the system for reaching its highest performance at test level.
Even though the proposed CAD was evaluated using the test set labels, which were not available during the challenge competition, the robustness of the method would have led to the best competition results with just a few submissions. Table 4 and Fig. 5 illustrate how the method is robust against small variations on the optimal parameters and classifier choice, providing with accuracy values over 60% for a wide range of combinations.

V. CONCLUSION
Using the available data for A Machine learning neuroimaging challenge for automated diagnosis of Mild Cognitive Impairment, we developed a post-competition method for multiclass classification. The method is based on a one vs. one approach for feature selection, PLS feature extraction and classification. The presented methodology is capable of identifying the most relevant features for a multiclass classification by a sorting-and-filtering method, and is evaluated using different parameters and classifiers. The results are robust against variations of parameters and classifiers, and they outperform all the proposals submitted to the challenge by more than 5 percentage points in accuracy. The method is also coherent with recent findings in CAD of AD, and can be applied to other multiclass classification problems.  DIEGO CASTILLO-BARNES received the degree in telecommunications engineering, in 2012. He is currently pursuing the Ph.D. degree with the University of Granada. Before his Ph.D., he works at the private sector. He has authored several articles in high-impact journals in signal processing in medical imaging, computer aided diagnosis systems, machine learning, and image processing fields. He has served as a reviewer for several international journals and even promoted to review-editor for Frontiers.