Automated Detection of Alzheimer’s Disease and Mild Cognitive Impairment Using Whole Brain MRI

Early diagnosis is critical for the development and success of interventions, and neuroimaging is one of the most promising areas for early detection of Alzheimer’s disease (AD). This study is aimed to develop a deep learning method to extract useful AD biomarkers from structural magnetic resonance imaging (sMRI) and classify brain images into AD, mild cognitive impairment (MCI) and cognitively normal (CN) groups. In this work, we adapted and trained convolutional neural networks (CNNs) on sMRI images of the brain from ADNI datasets available in online databases. Our proposed mechanism was used to combine features from different layers to hierarchically transform the images from magnetic resonance imaging into more compact high-level features. The proposed method has reduced number of parameters which reduces the computation complexity. The method is compared with the existing state-of-the-art works for AD classification, which show superior results for the widely used evaluation metrics including accuracy, area under the ROC curve etc., suggesting that our proposed convolution operation is suitable for the AD diagnosis.


I. INTRODUCTION
A LZHEIMER'S disease (AD), a neurological degenerative disease, is a prominent challenge of the 21st century. According to the statistics of the World Alzheimer report, more than 55 million people have been diagnosed with Alzheimer's disease, and this number is increasing day by day, with 78 million expected by 2030 [1]. Clinically, the disease manifests as memory loss, disorientation, and visuospatial disturbances [2], to circumvent which, many efforts have been made for early detection and diagnosis of AD. Partial treatment of AD can be done by observing the symptoms, however, a method to identify specific biomarkers in the cerebrospinal fluid (CSF) is being developed for a more accurate diagnosis [3]. This is an intrusive investigation that may cause harm to the patient [4]. In addition, advanced imaging techniques such as magnetic resonance imaging (MRI) and position emission tomography (PET) can also be used to diagnose the AD-related structural and molecular biomarkers [5]. In particular, magnetic resonance imaging (MRI) is a non-invasive and powerful tool for understanding and evaluating anatomical and functional brain changes associated with AD. They are recognized as indispensable in clinical practice and play an important role in AD disease progression [6]- [8].
Integration of large-scale, high dimensional and multimodal data from the rapidly advancing neuroimaging techniques impose difficulty for contemporary methods in identifying the disease. As a result, interest in computational machine learning methods for integrative analysis has exploded [9]- [12]. These methods have been used to generate a desired output from a set of training data, such as voxel intensity, tissue density, and shape descriptor features. However, in order to use these machine learning algorithms for the classification of disease, pre-processing techniques; which are mostly time consuming and computationally intensive must be adopted. Classification studies using machine learning algorithms usu-ally require four steps: feature extraction, feature selection, dimension reduction, and feature-based classification algorithm selection [13], [14]. These steps require specialized knowledge and multiple optimization steps, which are timeconsuming and laborious. In contrast, researchers have been exploring alternative techniques to the traditional machine learning which is, the Deep Learning (DL) algorithms. Deep learning has emerged as one of the widely adopted machine learning algorithm which has shown optimal results in a variety of domains including speech recognition, computer vision, and natural language understanding [15]- [18]. In general, deep learning methods are a subset of representationlearning methods in a way that they can automatically identify the best representation from raw data without the need for prior feature selection [19]. This is accomplished by employing a hierarchical structure with varying levels of complexity, as well as applying consecutive nonlinear transformations to the raw data. These transformations produce increasingly higher levels of abstraction, with higher-level features being more insensitive to noise in the input data than lower-level features [20]. Consequently, researchers are focusing their efforts on developing a deep learning model in the field of medical imaging that can accurately diagnose the disease. Recently, deep learning models have shown significant success in a variety of medical image analysis problems, including CT scans, MRIs, X-rays, ultrasounds, and sentiment analysis [21]. It has produced notable results in the detection and classification of specific diseases in the domains of the lungs, abdomen, brain, cardiovascular, retina and others [22].
Notably, the most widely used deep learning design, Convolutional Neural Network (CNN) has received a lot of attention in medical image analysis due to its success in image analysis and classification [23], [24]. These accomplishments have piqued the interest of researchers, to improve CNNbased systems for AD detection. Despite from the fact that the existing methods have provided good diagnostic results, little has been done to optimize the architecture of CNN for practical AD detection. Inspired by the success of deep learning methods and patch-based mechanisms in medical imaging, in this work, we propose an improved convolutional neural network for AD diagnosis and prediction using magnetic resonance images. The proposed architecture significantly reduces the number of parameters and computational costs compared to the conventional neural networks. Our research contributions are fourfold: 1) This work explores the convolutional layer's computational cost and concentrate on alternative methods to perform convolution operation in order to reduce the number of parameters and calculation costs while maintaining classification accuracy. 2) Each layer is composed of a block containing series of operations on the input. The resolution and size of the input to each operation in the block is kept same. Its usefulness was demonstrated by the improvement in classification performance of the proposed method.
3) The proposed model learns useful features from the input data without pre-processing and show superior performance. 4) The performance of the proposed method has been compared with two state-of-the-art classification models: ResNet50 and VGG, which show better accuracy results using less number of parameters.
The rest of the paper is organized as follows: Section II introduces the relevant literature review, Section III summarizes the materials and methodology used in this article. Sections IV and V presents the results of the experiments and their interpretation. Section VI concludes with an outlook on future projects.

II. RELATED WORK
Several approaches for neuroimaging classification have been proposed in the recent years to improve classification performance. We examine some machine learning (ML) classification structures used in neuroimaging, as well as methods based on convolutional neural network.

A. MACHINE LEARNING FOR NEUROIMAGING
In recent years, many machine learning based techniques have been used for multi-class classification and binary classification for the early detection of Alzheimer's disease. Kim et al. [25] proposed a fully automated classification method in which cortical thickness features for diagnosis have been used. Long et al. [26] investigate the regional morphological differences of brain and observed the deformation in hippocampus and amygdala to identify the progressive MCI whereas, diffusive morphological change in the whole-brain gray matter (GM) were responsible for the identification of mild or moderate AD. Subsequently, these subjects were classified using a linear support vector machine (SVM). Guo et al. [27] proposed to extract and combine the brain region and subgraph features of functional magnetic resonance imaging, which were then utilized to train the multikernel SVM for classification. This approach not only retains the global topological information but also the sensitivity to change in the brain region. In contrast to Guo et al., Khedher et al. [28] proposed a method for classifying AD using independent component analysis to extract features from regions white-matter (WM), GM and cerebrospinal fluid (CSF) for training SVM classifier. Moreover, Tong et al. [29] proposed a multiple instance learning for dementia classification, where features were formed by extracting pockets of MRI voxel patches and mapping them to graphs. An SVM classifier was then used to distinguish between AD patients and NC (Normal Control) subjects. In another work, Gupta et al. [30] used machine learning approaches (SVM, k-nearest neighbor (KNN) and Random Forest (RF)) to classify atrophic states (AD, NC/healthy control (HC), asymptotic Alzheimer's disease (aAD), mild Alzheimer's disease (mAD)) using combined voxel-based morphometry (VBM) features, cortical and subcortical volumetric features (CSC) and hippocampal volumetric features.

B. CNN METHODS FOR NEUROIMAGING
An enormous potential of deep learning has been observed in the medical image diagnosis [31]- [33], where initially, it was used for region segmentation or feature extraction, followed by traditional machine learning algorithms such as SVM and boosting. For instance, Silva et al. [34] proposed a convolutional neural network for feature extraction from MRI scan, followed by SVM, KNN and Random Forest algorithms for Alzheimer's disease classification. Similarly, Liu et al. [35] proposed a deep convolutional feature learning method for classification of AD and MCI using both unsupervised and supervised learning. Owing to the recent success of deep learning, especially convolutional neural networks in the field of computer vision for extracting image features, their potential has been explored for AD diagnosis [36]. For instance, Wang et al. [37] used brain extraction tools to select the hippocampus-containing slices and fed them to a convolutional neural network for diagnosis. In addition to that, a patch-based ensemble classifier was created to predict the AD and NC classes [38]. Furthermore, for the diagnosis of Alzheimer's disease, 2-D and 3-D deep learning models were developed in [39]. Unlike others, Basaia et al. [40] performed extra operation of augmentation of training data by deforming, cropping, rotating, flipping and scaling MRI data at different angles before feeding it into a convolutional neural network for diagnosis.
In a more recent work, Korolev et al. [41] proposed two 3D CNN architectures based on VGGNet and ResNet, proving that the step of manual feature extraction is unnecessary for brain MRI image classification. Their 3D models, 3D-VGG and 3D-ResNet, are widely used in research for classification of 3D medical images. Also the usage of 3D CNN were focused in Ehsan Hosseini-Asl et al. [7] and [42], in which Alzheimer's Disease Neuroimaging Initiative (ADNI) data was classified. 3D convolutional neural network was utilized for feature extraction from MRIs and to identify the biomarkers for multiple classes of AD. Similarly, Abrol et al. [43] also created 3D CNNs based on the ResNet architecture and tested them on a variety of binary and multiclass tasks. They created a training set for cross-validation and a small test set using ADNI data. The results were promising however, no comparisons were made with other standard evaluation frameworks, which leads to ambiguity that the model is overfitted on the training examples. In [44], a deep learning based classifier for AD versus NC classification has been proposed where the discrete volume estimation model with convolutional neural networks has been used to extract deep features of the discrete volume of left and right hippocampal model (RHM and LHM). Recently, [2], proposed the diagnostic model AD based on a densely connected 3D CNN and an attention-driven mechanism to combine highlevel features and spatial information extracted from MRI. J Liu et al. [45], proposed the depthwise separable convo-lution, which replaces the conventional convolution with the depthwise separable convolution. AlexNet and GoogLeNet transfer learning models were used to train their idea, which significantly reduced the computational cost and parameters.
In this paper, a CNN network structure has been proposed where the standard convolutional layer along-with depthwise-pointwise convolutional layer was embedded to extract more features that are sensitive to brain activities or structural differences in large regions. To the best of our knowledge, such architecture has not been used for the brain image classification. This greatly reduces the number of parameters while maintaining reasonable accuracy compared to other benchmark models.

A. MRI ACQUISITION PROTOCOL
The ADNI was established in 2003 as a public-private partnership led by Michael W. Weiner, MD. The primary goal of ADNI was to determine whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment could be used together to track the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). T1-weighted MR images were acquired sagittally in ADNI using a volumetric 3D MPRAGE with a spatial resolution of 1.25 × 1.25mm 2 in-plane and 1.2mm thick sagittal slices. The majority of these images were captured using 1.5 T scanners. More details about the MR acquisition can be found at the ADNI's website (http://adni.loni.usc.edu/).

B. PARTICIPANTS
We evaluated T1-weighted MRI data from ADN1/ GO of participants, including 163 AD patients, 163 subjects diagnosed with MCI, and 163 normal controls (NC) within 24 months in this study. Table 1 shows the demographic details for each group. The structural MRI scans used in the study had already been reviewed for quality and underwent gradient inhomogeneity correction (gradwarp), B1 non-uniformity correction, and N3 processing (to reduce residual intensity non-uniformity).

C. IMAGE PATCH GENERATION
Due to the volumetric nature of MRI images, the natural choice of deep learning model is a 3D Convolutional Neural Network (3D CNN) [46]. Compared to 2D CNN models, 3D CNN models are computationally intensive and time- VOLUME 4, 2016 consuming to train due to their high-dimensional input. Another problem is that most current medical datasets are quite small. Due to the scarcity of data, it is difficult to train a deep network that can be generalized to high levels of complexity. Therefore, in our study, we used 489 3D MRI scans of dimension 192 × 192 × 160 which cannot be directly fed into a 2D CNN model. First, the 3D MRI scans were downsized to 96 × 96 × 96. The images were then divided into axial, coronal, and sagittal slices. The slices in the beginning and at the end were discarded since they did not contain any useful information. Also, the slices were normalized with a mean of zero and a standard deviation of one. The randomly selected axial, coronal, and sagittal patch slices were then used for training the 2-D CNN model. Some examples of slices from MRI scans of CN, MCI, and AD patients are shown in Figure  1.

D. NETWORK ARCHITECTURE
The CNN architecture used in this study was inspired by the human visual cortex. The human eye receives information in its receptive field, which is similar to the convolution operation that folds the input image and creates the feature map by working with the receptive field of the input. The convolution involves multiple layers, including ReLU activation features, max-pooling layers, and fully connected layers. Each input is processed through these operations to produce a final output in the form of a binary or multiclass classifier. The convolutional operation is interconnected by a set of neurons, shared hyperparameters, local connectivity and shift invariance, to increase the performance of the network. Based on this motivation, we proposed an end-to-end deep CNN architecture for multi-label AD biomarker identifica-tion using the entire image volume as input. Furthermore, we have also used the popular CNN architectures such as VGG-Net and ResNet, which have been used for classification tasks for comparison with the proposed method. These models have been successfully used in a variety of applications including image classification, recognition, image labelling, and pose detection [47]. In this section, we discuss in detail our proposed CNN architecture as well as related methods used in this work.

1) VGG-Net Model
The VGG network is a brand name for the pre-trained CNN model proposed by Simonyan and Zisserman [48] at Oxford University in early 2014. VGG (Visual Geometry Group) was trained on the Image Net ILSVRC dataset, which consisted of images belonging 1000 classes, with 1.3 million images used for training and 50, 000 for validation. VGG-19, a variant of VGG architectures, has 19 deeply linked layers and has consistently performed better than other state-of-the-art models. The model consists of highly linked convolutional layers and fully linked layers, which allows better feature extraction and the use of max-pooling (instead of averagepooling) for down-sampling before classification with the SoftMax activation function. The VGG-19 model is used as a baseline model in this study and the results are improved with the ADNI dataset to classify multiple stages of Alzheimer's disease. The architecture of VGG-19 is shown in Figure. 2.

2) ResNet Model
Residual Network [50] won first place in classification, localization, and detection at ILSVRC-2015. The researchers wanted to explore if learning better simply meant addition of more layers on top of each other in the network. They discovered the degradation problem, where traditional models similar to VGG did not improve their performance after a certain number of layers, rather get worse. To solve this problem, they proposed the residual function, which is the basic building block of a residual network (ResNet). In this work we directly adopted ResNet from the non-bottleneck 50-layer architecture, where links with increasing dimension were either (A) identity links, i.e., zero padding, or (B) projection links, i.e., convolutions with 1 × 1 filter (kernel) size, and used it as the basic model to classify AD using the ADNI dataset. Figure. 3 shows the basic architecture of ResNet-50 (only 34 layers are shown for simplicity).

3) Proposed Model
Convolutional layers are the fundamental building blocks for any deep CNN that achieves the best performance by using complex activation functions. To diagnose Alzheimer's disease, the proposed model uses a deep convolutional neural network to automatically extract features from whole brain MRI scans. Figure. 4 shows the proposed pipeline consisting of three main steps: brain volume resizing, 3D volume slicing, and CNN processing. Inspired by the architectural pattern of ResNet and ConvMixer [51], we proposed a simple yet effective convolutional method that simultaneously performs standard convolution, depth-wise convolution, and point-wise convolution, followed by a skip convolution layer to learn multi-level features from MRI scans of the brain. where D F denotes the spatial width and height of the square input feature map, M denotes the number of input feature map channels, and N denotes the number of output feature map channels. The feature is extracted from the D k ×D k sized convolution kernel of the standard convolutional layer. D k denotes the spatial width and height of the convolution kernel. The formula for the computation process of the standard convolution from feature map I to feature map O is given by: Where I are the input feature maps, G are the output feature maps, and k are the convolution kernels. The position of the convolutional kernel elements is given by i and j. k and l determine the position of the element in the input feature map and the output feature map, respectively, while m is the channel of the input feature map and n is the channel of the output feature map.
The parameters of standard convolution are calculated as follows: The cost of computing standard convolution is shown by: Where F stands for the total number of model parameters, G represents the computational cost, M represents the number of channels of the input feature map, N represents the number of channels of the output feature map, D F represents the spatial width and height squared of the input features of the object map, and D k represents the spatial width and height of the convolution kernel.
b: Depth-wise and Point-wise Convolution Operation Convolution in depth is a convolution operation performed separately for each channel of the input image. It is used to extract spatial features in each dimension. Point-by-point convolution is a standard convolution operation for the output feature map. Figure. 6 shows a depthwise and point-wise structure, where the size of the input image is D f ×D f ×M , where D f is the height and width of the input image and M is the number of channels of the map, and the size of the output feature map obtained by the convolution is D g ×D g ×M (D g is the height and width of the output image), which is equal to the number channels of the input image. It is used as the input for the next convolution. The size of the convolution kernels for point-by-point convolution is 1 ×1, and the number of channels on each convolution kernel must be equal to the number of channels of the input feature map. When the number of convolution kernels is N , the output feature map after convolution is D g ×D g ×N .
The feature map for the depthwise convolution output is written as:Ḡ k,l,m = Σ i,j K i , j , m .I k+i−1,l+j −1,m , where I denotes the input feature maps,Ḡ denotes the output feature maps, and K denotes the convolution kernels. The element position of the convolution kernel is determined by I and j. The values k and l determine the position of the input feature map element and the output feature map, respectively, while m represents the input feature map channel.
The calculation of the convolution parameters in depth and the cost function are denoted as follows: and The number of parameters is only proportional to the number of feature mapping channels and convolution kernels entered. The computational cost varies with the number of input feature mapping sources, convolution kernel, and quadratic input feature mapping function. The convolution depth parameters and computational cost need not consider the output feature mapping N. Compared with relations (2) and (3), equations (5) and (6) clearly show the simplicity of depthwise convolution. However, unlike a traditional convolutional layer, depth-wise convolution only filters the input channels and does not simply combine them to create new VOLUME 4, 2016 FIGURE 4. The layout of the proposed model. The first part is based on slice selection, whereas the second part introduces an isotropic architecture that allows the network to mix multiple feature maps that are more useful for classification. features. Therefore, to generate new features, an additional layer of 1 ×1 standard convolution is required [54]. The new feature map is generated by the adding the depth-wise and 1 ×1 pointwise convolution. Therefore, the parameters and the cost function can be calculated as shown in equations (7) and (8), respectively: and

E. IMPLEMENTATION DETAILS
In this work, we have built our 2D-model using 3D structural MRI scans of 489 patients (163 AD, 163 MCI, and 163 CN). The original MR images were resampled to 96 × 96 × 96 resolution. A total of 40 brain fields were generated from each part of the subject (axial, coronal, and sagittal) for training and testing, resulting in 58680 feature fields, from which 19560 belonged to AD group, 19560 to the MCI, and 19560 to the NC group. During experiments, each group was randomly divided into three parts: Training (60% of the subjects), Validation (10% of the subjects), and testing (30% of the subjects). The proposed classification method was implemented in Python 3.8.10 using the Keras library based on Tensorflow 2.5.0, and then run on a computer with NVIDIA RTX3090 GPU tested in the Ubuntu 20.04-x64 environment. Initially, the network parameters were initialized randomly and the Adaptive Moment Estimation (Adam) optimizer was used with an initial learning rate of 0.001 and a decay rate of 0.9. The batch size was configured to 32 and dropout layer was added to avoid overfitting. We used the proposed algorithm to discriminate between AD patients, cognitively normal (CN) and mild cognitive impairment in a multiclass classification task (MCI), and to binary classify between AD and NC, AD and MCI, and CN and MCI.

A. PERFORMANCE EVALUATION
In the diagnostic tasks, the predicted results are given as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). TP refers to a positive sample predicted correctly as a positive sample. TN refers to a negative sample correctly predicted as negative. The symbol FP means that a negative sample is falsely classified as a positive sample. The symbol FN means that a positive sample is falsely predicted as a negative sample. To evaluate our diagnostic model, we use the following widely used indicators: accuracy, specificity, sensitivity, precision, F1 score and receiver operating characteristic curve (ROC curve) . As shown in relation (9), accuracy is the proportion of correctly diagnosed samples among all test samples.

Accuracy =
T P + T N T P + T N + F P + F N As shown in equation (10), specificity represents the percentage of samples correctly diagnosed.

Specif icity =
T N T N + F P (10) As shown in equation (11), sensitivity represents the ability of a model to identify AD patients in all positive samples.
Precision in equation (12) is defined as the percentage of correctly predicted positive observations out of all predicted positive observations. P recision = T P T P + F P The F1 score is calculated as the weighted average of Precision and Recall. As shown in equation (13).
An ROC curve (receiver operating characteristic curve) is a graph that depicts a classification model's performance across all classification thresholds. This graph depicts two parameters: First parameter is True Positive Rate (TPR) also referred as a recall (14).
Whereas, the second parameter is False Positive Rate (FPR) defined in equation (15).

B. COMPARISON WITH BASELINE MODELS
The performance of the proposed method is evaluated using the ADNI database. Here, the proposed method showed good performance in a multi-class classification task (AD versus MCI versus CN) with a highest accuracy of 96.41%. Comparison of the proposed work with other models trained on the same dataset is presented in Table 2. The comparison show that the proposed model outperforms the benchmark models in terms of accuracy despite having less number of parameters.
ResNet and VGG Net were selected as the benchmark models for comparison. Using common indicators of classification performance such as sensitivity (SEN), specificity (SPE), precision, F1 score and accuracy (ACC), results show that ResNet achieves an accuracy of about 95.34%, which is 2.08% higher than the VGG (93.26%) due to feature propagation enhancement and skip connection, but lower than the proposed work. Our model achieved the highest accuracy of 96.41%, which is 1.07% higher than ResNet and 3.15% higher than the VGG network. Our model also outperformed the baseline models in terms of other performance metrics. For example, it achieved a specificity of 97.73% and precision of 95.50%, both of which are higher than baseline models. With a sensitivity and F1 score of 95.31% and 95.32% respectively, ResNet was marginally better. Figure. 7 (a) shows the corresponding ROC for the ADNI dataset as well as the corresponding confusion matrix in Figure. 7 (b). Figure. 7 (c) and (d) depicts the accuracy and loss graph respectively for both training and validation dataset. We have demonstrated the effectiveness of our CNN model developed in this study for both binary classifications and multiclass classifications. Training for binary classification is performed for three different scenarios: AD versus CN, AD versus MCI, and MCI versus CN.

C. COMPARISON WITH OTHER EXISTING METHODS
Using the ADNI database, we compared the classification results of our model with those from previous studies (see Table 3). We started by comparing our proposed model with traditional machine learning methods. Liu et al. [55] proposed ROI-based methods named as whole brain hierarchical network to extract brain features, which were then classified using machine learning method such as multiple kernel boosting (MKBoost) algorithm. It achieved 94.65% for AD vs CN, 89.63% for AD vs MCI, and 85.79% accuracy for MCI vs CN classification using a single structural MRI modality dataset. Sun et al. [56] achieved similar results. In their study, they proposed a new SVM based learning method to extracted spatial-anatomical information and also introduce a group lasso penalty to induce the structure sparsity. Their proposed method achieved 95.1% for AD vs CN, 70.8% for MCI vs CN, and 65.7% accuracy for AD vs MCI classification. In addition, we compared our model with existing deep learning methods. In their work, Hosseini et al. [7] proposed a 3D convolutional auto-encoder-based method. To capture anatomical shape variations in structural MRI scans of the brain, they used a pre-trained model. Later, they tested their model with CAD Dementia MRI dataset without preparatory skull removing testing and achieved 89.1% accuracy in multi-class classification. Furthermore, on binary classification tasks, their model achieved 97.6% for AD vs NC, 95% for AD vs MCI, and 90.8% for MCI vs NC. Basaia et al. [40] proposed a deep learning algorithm for predicting individual Alzheimer's disease diagnosis based on structural cross-sectional MRI scan. Their proposed model was 99.2% accurate for AD vs CN, 87.1% accurate for MCI vs CN, and 75.4% accurate for AD vs MCI. Whereas, Liu.J et al. [57] used the OASIS dataset to construct a CNNbased architecture that achieved 78.02% accuracy for multiclass classification, 84.65% for MCI versus CN, 72.96% for AD versus MCI, and 75.2% accuracy for MCI versus CN classification using the ADNI data set. Later, in the same paper, they improved their work to reduce the number of parameters using a deep separable convolution model and achieved 77.79% accuracy by reducing the parameters of the CNN model by 87.94%. To learn the features from the segmented part of the hippocampal, Liu.M et al. [57] developed an architecture combining 3D Densely connected convolutional networks (3D Dense Net) and multi-task CNN. (c) Accuracy plot between training and validation data.
(d) Loss graph between training and validation.   [59] proposed a transfer learning scheme based on convolutional neural networks to automatically classify brain scans, relying on small ROI (few slices of the hippocampal region). Their evaluation shows that AD vs CN is 91.86%, AD vs MCI is 69.95%, and MCI vs CN is 68.52%. Table 3 summarizes comparison of different studies, whereas, Table 4 compares several proposed methods in terms of parameters. Both these tables show the discrimination tendency of the proposed method with reduced number of parameters.

V. DISCUSSION
Effective and accurate diagnosis of Alzheimer's disease is critical for early intervention and treatment of the disease. Therefore, researchers have focused on developing computer-based systems to detect Alzheimer's disease at an early stage. Since then, CNN-based image classification has been widely used in medical disease diagnosis. However, it is not possible or realistic to create an efficient CNN model capable of producing good results. In this study, we investigated the classification of MRI images using CNN features with improved accuracy and less number of parameters. Previously, the contemporary models focused on increasing the depth and complexity of the network to improve classification performance. In this study however, a new approach to reduce the number of parameters and the computational complexity of a CNN is presented. Earlier proposed model with increased depth suffered with the issues of  vanishing gradient. In this regard, we attempted to mitigate the problem of vanishing gradient, promote feature reuse, and greatly reduce the number of parameters by proposing a modified convolutional network. Where, the network consisted of three different layer types. The first type was the input layer, into which the N gray level image X n , n ∈ [1, N ] patches were taken, then their pixels were scaled to 96 ×96 and normalized to the interval [0, 1] and fed into the network. The second type was a convolutional layer, Table 5 shows the proposed body architecture used in this work. The proposed CNN model was built by interaction of four standard convolutional layers, three convolutional blocks, each of which consists of a skip connection based depth-wise convolution (i.e., grouped convolution with groups equal to the number of channels) followed by point-wise (i.e., kernel size 1 ×1).
Each convolution in the block is followed by the GELU activation function, Batch Normalization, and dropout layer. This model also uses a residual convolutional layer inspired by the skip connection model. The size of the convolution filters was set to 5 ×5, and the filter numbers were set to 256. This was maintained throughout the model to ensure that the same weights are shared across the different set of pixels in an image. The fully connected layer was the third type of layer and consisted of a set of input and output neurons that produced the learned linear combination of all neurons from the previous layer after passing through a non-linearity. The inputs and outputs of the fully connected layer were no longer spatially arranged, but represented a 1D vector.
We used three CNN architectures in this paper: VGG, ResNet, and our proposed model. VGG networks are known for their uniformity, which makes them relatively easy to customise, verify, and use for a wide range of tasks; however, this property also makes them significant in terms of the number of parameters and hardware requirements. These drawbacks have been addressed in ResNet architectures, which also use very specific building blocks and allow the extraction of more complex patterns from data, while increasing the number of layers and decreasing the number of parameters. Taking into account the above drawbacks, our architecture tries to address them through the concept of feature mixing as described in [60], [61]. We have chosen depth-wise convolution to mix spatial locations, and point-wise convolution to mix channel locations, as well as the skip convolution layer to mix global features from the data while reducing model complexity. In addition to that, from a practical point of view, hardware performance must be regarded as an important task. To accomplish this, floating point operations (FLOPs) also consider as an important predictor of energy usage and latency along-with the number of parameters [62]. Generally speaking, the network with lower parameters and FLOPs, required less memory to save the model, required less hardware memory, and thus it is more friendly to the embedded end. To this end, a comparison of accuracy and the computational costs of the benchmark and proposed model is depicted in Figure. 8. The advantage of the proposed model is that the VOLUME 4, 2016 Furthermore, distinguishing individual groups, particularly MCI who were at high risk of developing AD, is critical for clinical control and management of the disease. We obtained an accuracy of 97.00% for AD vs. CN, 96.29% for MCI vs. CN, and 88.00% for AD vs MCI using the proposed method, indicating an improvement in performance compared to some popular network models and indicating the potential of the proposed model to detect subjects in prodromal dementia.

VI. CONCLUSIONS
We optimized a CNN for 3D whole brain images using ADNI data and obtained the best accuracy with an isotropically repeated convolutional block network architecture. Contemporary state-of-the-art systems were outperformed by the proposed method. Moreover, our approach is fully automatic (i.e., no additional information input or manual intervention is required) and extremely fast. The proposed method can be used to find meaningful patterns in data, confirm previous findings by specialists, aid in diagnostic scenarios, and eventually identify patterns for diseases other than Alzheimer's. Future research could look at obtaining similar or better results for images that have already been pre-processed for skull alignment and subtraction. Finally, it would be interesting to include patient history data to enrich the information available in the MRIs, drive the decisionmaking process, and link it to the patient's background.