3D CNN Design for the Classification of Alzheimer’s Disease Using Brain MRI and PET

Attempt to diagnose Alzheimer’s disease (AD) using imaging modalities is one of the scopes of deep learning. While considering the theoretical background from past studies, we are trying to identify convolutional neural network (CNN) behaviors moving from 2D to 3D architecture. This study aims to explore the output from a variety of CNN models implemented in the MRI or/and PET classification tasks for AD prediction while trying to summarize its characteristics with a variety of parameters that are tuned and changed. There are many architectures available; however, we are testing a basic architecture with a change in the reception area based on the convolutional layer’s kernel size and its strides. The architecture has been categorized as converging, diverging, or equivalent if the filter kernel size is unchanged. This investigation studies a simple encoder based CNN with a sequential flow of features from low-level to high-level feature extraction. The idea is to present a diverging reception area by increasing the filter size and stride from a lower to a higher level. As a result, the feature redundancy is reduced and the trivial features keep on diminishing. The proposed architecture is referred to as ‘divNet’, and several experiments were performed to determine how effective the architecture is in terms of the consumed memory, the number of parameters, running time, classification error, and the generalization error. This study surveys several related experiments by changing the hyper-parameters setting, the architecture selection based on the depth and area of the reception feature, and the data size.


I. INTRODUCTION
In reference to the Alzheimer's Association Report (AAR) [1], the molecular and neurological causes for Alzheimer's disease (AD) takes place in the neurons, i.e. the brain nerve cell connection area also called the synapsis; this is where the neurotransmitters are released. The synapsis helps with the information flow caused by tiny bursts of chemicals that are released by one neuron and are detected by a receiving neuron. During AD, there is an accumulation of ß-amyloid proteins and tau proteins, also known as tau tangles that are around the synaptic region. This ß-amyloid is suspected to cause neuron death by interfering with neuron-to-neuron communication at the synapsis. In addition, the tau tangles block the supply of nutrients and other essential molecules inside the neurons. Brains with advanced AD have a dramatic The associate editor coordinating the review of this manuscript and approving it for publication was Usama Mir .
shrinkage due to cell loss, inflammation, and widespread debris from dead and dying neurons. This causes memory loss problems (e.g. dementia) with the inclination of age. This is the molecular and physiological level analysis for AD. However, there is a corporal change in the common AD-related variation of anatomical brain structures such as the enlargement of ventricles, shrinkage of the hippocampus shape, change in the cortical thickness, and other cerebral areas containing white matter and gray matter brain tissue as well as cerebrospinal fluid. These changes and atrophies are rationally visualized through the brain imaging by the clinician while using a variety of medical imaging modalities like magnetic resonance imaging (MRI), positron emission tomography (PET), and computed tomography (CT) scanning. Here comes the true usage of image processing and machine learning. Image processing improves the quality of the image for better visualization of the brain whereas machine learning assists clinicians to perform other logical operations like segmentation, classification, and quantification, which can be time-consuming and sometimes baffling. The logical operation once modeled with proper supervision can later follow the designed algorithm to reach a prediction, the more the prediction is true, the better the model is, and the higher will be the chance of reliability. Mild cognitive impairment (MCI) is a transitional stage between normal aging and the preclinical phase of dementia. MCI is considered to be a possible early stage of AD, and it can either progress into AD (pMCI) or remain in the same stage throughout life, which is called stable MCI (sMCI). Here, we are combining both types in a single MCI group to ease the classification process. A healthy MRI is called normal aging/cognitively normal (CN). Since AD contains a genome that affects the disease, no known stimulant causing it is identified. However, the influencing factors for AD include genetics, low education or professional involvement, lack ofmental exercise, family chronicles, and external or internal brain injuries [2].
Image processing aims to find a discriminative pattern of image features by collecting the same groups of the MRI into one. It means that the pattern that we eventually discover for AD patients will behave the same for other AD patients' recognition but are differentiated with the CN and MCI effected MRI. Once the MRI is translated into an image from the magnetic resonance frequency, it represents the pixel value for each structure and these pixels will be assigned to a class. Ultimately, AD classification will be based on the features that are extracted from these brain image pixels. The main features required to accurately capture the major ADrelated variations of the anatomical brain structure includes the size of the ventricles, hippocampus shape, cortical thickness, and brain volume [3]. Although such alterations may resemble other brain-related diseases like Parkinson's disease (PD) and encephalitis [4]. In that case, more clinical and physiological tests should be performed on a genetic level. Consequently, the idea of identifying pathogenic scans from a healthy one seems easier than identifying a particular disease from a pool of pathogenic scans. Thus imaging technique alone may not be the only valid proof to diagnose a person with AD. However, based on the brain phenotype reflected in the imaging, the discriminative features from the trained network can help identify AD prone images.
This study presents results that answer a few questions related to the use of deep learning for medical imaging. It starts with the background story of CNN and recent literature reviews of its implication in medical image classification. Then the related inquisition of the CNN role for its architecture, hyper-parameters, depth, and data-size is discussed in section II. The mathematical orientation and used pseudocodes are discussed in section III. In the succeeding sections, we have discussed the performance of different architectures, the role of the hyper-parameters, the selection of data, and the effect of dataset size for the design of the optimal network so it can be implemented practically. Subsequently, we have surveyed with shallow to deep layers using different feature sampling region and finally came up with a diverging architecture being supportive in the case of both MRI and PET. The proposed architecture, which is referred to as 'divNet', and its sibling architectures have been thoroughly investigated and the results are presented in sections VII and VIII. All the results of these experiments are meticulously presented, discussed, and analyzed here. So, we are stating it as a surveybased research paper.

II. THE BACKGROUND STORY
A. 3D CNN Inspired by the neural network architecture of the mammalian cerebrum, an artificial neural network (ANN) tries to mimic the information flow and the decision-making pattern of the brain. As demonstrated by Hubel and Wiesel [5], they recorded the activity of a single brain cell in cats. It was stated that some cortical cells respond to contours of a specific orientation. Aside, patterns of light stimuli are most effective in influencing units at one level and they may no longer be the most effective for the next. Although millions of neurons and synapses receive the stimuli, only certain neurons are trained to respond to those specific features or aspects of an image [5]. Similar to the brain when we receive any stimuli, the neuron spike is generated for only a specific area, ANN will only have a few activated nodes for each shape, which may be a horizontal, vertical, or diagonal line. The node activated for each line is different and unique. This means that the node activated for a horizontal line in one image is activated for the horizontal line in another image and so on; this is the basic principle of an ANN. The layerwise connection between the nodes may indicate the heavy connection between the neurons.
CNN is similar to ANN, except it has convolutional filter elements (weights) unlike single-node multiplication in ANN. Besides, CNN has extra feature investigators in the form of pooling and activation functions. Thanks to the newly developing algorithms that train the CNNs more effectively, which has ultimately surpassed human-level accuracy for natural image classification [6], [7]. With a wide variety of CNN based topology, the prominent ones include residual (Resnet50, ResNet101 [20]), recurrent (RCNN [24]), inception (GoogLeNet [21]), encoder-decoder (U-net [38]), and so on. One can notice that the common element in all of the topologies is the encoder unit i.e. convolution-normalizationactivation-pooling, which acts as the fundamental unit for feature generation. Therefore, we are building blocks of a combination of these encoding layers.
The existing ideas in the 3D CNN are mainly 'the best patch' or 'multiple patches trained for the CNN ensemble' based architectures [8]. In 'the best patch' approach, a single region of the brain is selected based on the recommended region of interest (ROI) or it is manually assisted from the anatomic region of atrophy, like the hippocampus and amygdala whereas in 'multiple patches trained for the CNN ensemble' multiple CNNs from multiple ROIs are trained separately for each region, later performing feature concatenation at the last fully connected layer (FCL) before classification. One of the reasons behind using only limited/selected/informative pixels to feed in 3D CNN may be due to GPU memory constraints and also to increase the information with quality. Non-discriminative parts although play a role in feature construction at a low level may not necessarily support the cohort classification, hence information becomes redundant using a whole-brain model. Also, selecting an ROI patch, or simply the best region makes the system semi-automatic; hence, the truest sense of automatic feature extraction is not applied in these cases. This research aims to make the classification simpler and candid rather than a multifaceted process. That's why we want to build an automatic and discriminative CNN that can work for MRI, PET, and any other 'pixelary '(pixelbased) object/entity irrespective of its input size.
Huang et al. [3] works were mainly focused on the hippocampi region; they proposed multimodal 3D CNN that uses hippocampi region ROI from MRI and hippocampi and/or cortices ROI from PET, without segmentation as a prerequisite task. They separately trained the CNN referenced with VGG architecture, for MRI and PET modalities based ROI and later concatenated from final FCL before final classification. In other similar attempt done for multimodality based 3D CNN, Liu et al. [9] also proposed a simpler CNN model like Yechon et al. but instead of concatenating the final FCL, the concatenation was done in the convolution layer, from each CNN ( trained using PET and MRI patch) for sequential convolution until flattening features at FCL. They experimented with T1-MRI and FDG-PET based cascaded CNN, which utilizes a 3D CNN to extract features, and adopted another 2D CNN to combine multi-modality features for task-specific classification. In 2016, Asl et al. [14] proposed a deeply supervised and adaptable 3D CNN (DSA-3D-CNN), trained on structural MRI (sMRI) images, which gives the prediction for the AD vs. MCI vs. CN task. Similarly, Payan and Montana [34] also used sparse autoencoder (SAE) patch-based 3D CNN to classify MRI scans using dataset partitioning unlike Oh et al. [35], where they performed 5 fold cross-validations (CV) using convolutional auto-encoder (CAE) based volumetric or 3D CNN for AD vs. NC and supervised transfer learning for sMCI vs. pMCI classification.

B. WHY MOVE FROM 2D TO 3D?
This study aims to explore one more dimension for CNN i.e. the depth. And the key question that needs to explore is: can we only depend on the 2D CNN results?
As mentioned, the 2D CNN can easily be misled [37] in the sense that a target domain trained CNN can only give a probability score for each trained class. Besides, a few pixel changes can make the prediction a disaster [37]. Some researchers have suggested possible improvement in performance over 2D images if 3D whole-brain structure is used to train CNN [10], due to its deeper architecture. But deeper architecture means more parameters (weights in layers) to train, and at the same time requires bigger and better training material. CNN either 2D or 3D follows a generic feature extraction pattern [11], [12], here generic features might suggest CNN features, also called 'off the shelf CNN features' [13] which is basically the image features extracted from the multiple convolutional layers as the weights (as a decimal number) of the trained network, applying various activation functions. Typically, the final feature weights from the FCL are graphed out to decide the performance of CNN. This means a well-separated class-based segmented graph generally depicts a well-trained classifier [14].
The problems with 2D CNN are to select appropriate slice or slices and its orientation as training inputs. A number of the literature suggests the 'best scan' [15], [16] or 'best multiple slices' [14], [17] for efficient performance, which rather mystifies the slice selection process. This is problematic and quite impracticable every time. Some important information might be missed if we focus only on limited scans or orientation. So the best and safest way to ensure is to use whole brain volume, which comes in a three-dimensional pixel value meaning, pixel value for x, y, and z dimension in planar geometry. In our previous work [18] we demonstrated that 2D CNN when trained from fewer MRI images results in poor classification performance, moreover the selection of slice or slices is still ambiguous. Furthermore, the dimension constraints of 2D-CNN need to make the architecture deeper and bulkier to accommodate thousands of images per class. Hence to make the MRI classification universal and less tedious, 3D MRI fits readily into 3D CNN. Besides, the transfer learning idea seems an inappropriate choice as the popularly available models like AlexNet [19], ResNet [20], GoogLeNet [21], ZNet [22] are all 2D based architecture. The use of 3D input requires fewer pre-processing steps like slice-correction, slice-selection, and slice-extraction. As a result, the manual processing step is reduced and makes the system more robust and automatic which is the goal of this study. Regarding image preprocessing we have only performed image resize and normalization before being fed into the CNN because we want to make the system less customized and work indifference to the imaging protocols and scanners, selections to be discussed in section VII B. Aside, mature preprocessing steps require more effort and operation time, keeping in mind that it is already processed from Alzheimer's disease Neuroimaging Initiative (ADNI), (we are not provided with the raw image from scanner, but a semi-corrected processed MRI). This can eventually be useful for the generalization of the trained model.

C. FINDING THE CORRECT ARCHITECTURE AND HYPER-PARAMETERS
Although CNN can be easily misled, CNN is quite smart. Irrespective of the depth (deep or shallow layer), the training material (good or bad), or the training size (small or big), CNN finally learns something when it is trained. This 'something' may not typically relate to the human interpretable logical features (say like the number of legs in a dog in comparison to a human) however they will categorically learn some details so it can be classified. Most of the time this involves basic shapes, edges, corners, and patterns on the objects. So, we don't need to worry about selecting architecture every time, nevertheless when it comes to finding the best architecture, with ease of training, and good performance. The trio gives an ultimate contest to any deep learning researchers. Performance results, training time, validation period, the confidence of prediction, generalizability, and other factors are the key to determine the state-of-theart winner. The results of our experiment are highlighted in Tables 1, 2, and 3.

D. HOW DEEP SHOULD WE GO?
Recent studies have suggested that a CNN can extract convenient features directly from a raw image, unlike a manually supervised learning algorithm and it has a strong capability to locate key points and features in object detection tasks for natural images [23], [24]. This property of the CNN has been explored in a region-based convolutional neural network (R-CNN) for region-based detection in 2D images. Other work in segmentation using a CNN suggests that segmentation results itself do not contain information needed for the classification, hence not being a pre-requisite for the classification task subsequently the CNN can learn useful features without labeling the voxels itself [3]. These entire experiments advocate supporting the generic feature extraction property of CNN. But how deep should we go is the question. Our obvious choice of going deeper is to extract more meaningful features to perform a relevant operation of classification or segmentation from the trainee dataset. In general, we will have more feature vectors with more layers, and subsequently a large pool of features to extract from. This will help in terms of 'judging' the best out of the good features. Nevertheless 'we should go deeper' [25] doesn't necessarily mean for the deep learning model and not every time. Besides, the result is not that supportive. The work of He et al. [20] in ResNet shows that a deeper network with 1,202 layers in comparison to 50, 101, and 152 convolutional layers has no significant improvement with an aggressive depth. With the additional cost of extra training, more depth for a network may make it more prone to overfitting by learning ''too well'' and this may not generalize the model at the cost of running expensive GPUs which makes it more challenging to build models, being able to understand all details [26]. The breakthrough of the ImageNet dataset with its implementation in Alexnet suggests that the better the data, the better would be the result. To support this theory artificial dataset are also created with different augmentation techniques. And well, the result seems to be supported by the use of extensive synthetic MRI for improved performance in segmentation and classification tasks [27], [28]. The case with ImageNet is the classification of 1000 classes with around 8000 images in each class, which means more classes with more distinctive images, similar is the case with other datasets like CIFAR101, Caltech, etc. where data acts like oil to AI [29]. Having said that, what may be the case with the medical image? Considering labels as the most precious assets for the data scientist, how voluminous should the training materials be? In the case of medical images, the task is more challenging, with an image-based feature; we can rarely detect the atrophy pattern. Particularly if we look at AD vs. MCI or MCI vs. NC MRI or PET [FIGURE 1]. Hence to solve this we are experimenting with various sizes of the datasets, one big and the other small for MRI and PET tests. The results are highlighted in Table 4. Detailed demographics for each dataset type tabulated in the Appendix.

F. VISUALIZING FEATURES: WHAT HAS THE CNN EXTRACTED AND LEARNED?
A generalized CNN follows the reduction of features from the input to the final classification layer. The same is in this case, the input for the CNN is a 3D MR image obtained in NIfTI (Neuroimaging Informatics Technology Initiative) format with .nii extension. Once input is read using niftiread function inbuilt in MATLAB, it can be resized from its original size 256 × 256×256 or 256 × 256×170 to 64 × 64×64. After multiple down-sampling using the max-pool operation for each convolutional layer, it is reduced to 1,728 for the first FCL. To reduce overfitting, we have used the dropout and the other conceding FCL to make the final output 100 features per class, which is the input for the softmax layer. This idea of using multiple FCLs to map the target domain is often called target domain fine-tuning, which is the basis for transfer learning while using pre-trained networks. The activated features in the initial convolutional layer can detect pixel changes based on attributes like line, edge, and color [30] in a small window filter. These edge-based features pass through the intermediate layers of the CNN, and they are combined in a large number of filters, whose weights (initially kept at random weights or initialized using Xavier, He, Gaussian) is updated using backpropagation training following a specific optimization path like stochastic gradient descent (SDG) or Adam. These intermediate layers detect the activated parts of the image whereas the final layer learns discriminative features in the shape and pattern amongst the target domains. Once training reaches convergence, which means no more weight changes occur and the training accuracy reaches its maximum, the training stops. The network is now trained and it's a generic feature extractor, which is like a traditional algorithm that generates features. The generated features are the discriminative features that are used to distinguish the classes. This study uses multiple 3D filters that give 4D output in each layer i.e. one 3D feature map per filter, see [FIGURE 2]. Convolving the image with these filters produce a feature map that detects the presence of those features in the image. This nature of a CNN is the essence of its auto feature extraction and it helps in the automatic computeraided design (CAD) system.
It is difficult to predict the features that a CNN can learn without training it; thus, making it a tedious task to analyze the features. Since a single network may contain millions of parameters and we cannot mathematically predict the final converged value in each filter without training them. Hence, every time we train the CNN, the learned features need to be investigated. Once trained, the CNN is loaded with the filter weights, which are used to make the predictions with the test images. It is convolved in each layer to obtain different results for the different MRIs. The trained network is used to obtain the features as described in Pseudo-code 1, 2, and 3.

III. PARAMETER INITIALIZATION
Let's assume that the MRI/PET has a 64 × 64×64 matrix represented by I (i.e. I = I x i y i z i i=1 to 64 ). In total, this will result in 262,144 gray-scale values, which is the numerical representation for the 3D image. Since we are working in 3D, we will call each of these values a voxel, not a pixel. Each voxel has a 3D value with x, y, and z coordinates. Here, we are simply representing the MRI as a cube.
Hence, each voxel value mathematically assigns three coordinates, but for easy representation, we will use the single vector notation v where, v = I x i y i z i to make the computation simple. Let us consider the first convolution in the first layer as in Equation (1). Here, b 1 1 and w 1 N ,1 represent the initial bias and the weight of the first convolution kernel in the N th filter, which uses an initialization algorithm. Note ⊗ represents element-wise multiplication.
The window of the convolution operation then keeps on moving according to the stride size. To reduce this mathematical expression, this can be rewritten with shorter terms. For each node of the 3D convolution filter: where conv. 3 is a regular 3-D convolution without zero paddings on the boundaries. Following Equation (2), x l k is the input, b l k is the bias of the k th neuron at layer l, and s l−1 i is the output of the ith neuron at layer l-1. w l−1 ik is the kernel (weight) from the ith neuron at layer l-1 to the k th neuron at layer l. conv. 3 represents an element-wise multiplication of the [3 × 3×3] kernel size. For the very first convolutional layer, the input s l−1 i is the 3 × 3×3 matrix of the image pixel value (maybe normalized) that is scanned by a window of the same size.
When represented in a matrix or a discrete form, the Ndimensional convolution for the discrete, N-dimensional variables A and B can be defined with (3): Each k i runs over all of the values that can lead to legal subscripts for A and B. Thus, the 3D convolution runs as follows.
The layer convolves the input by moving the filters along the input vertically and horizontally. Afterward, it computes the dot product of the weights and the input, and then it adds a bias term. As the filter moves along the input, it uses the same set of weights and the same bias for the convolution; thus, forming a feature map.
In the SGD algorithm, the filter weights during the optimization are iteratively updated as shown in Equation (4) and Equation (5), where W t l denotes the weights in the l th convolutional layer for the t th iteration and E denotes the cost function (updated using backpropagation for minimizing the cost function) over a mini-batch of size N.
Here, α l in Equation (5), is the learning rate for the l th layer, m is the momentum due to the previous weight update in the current iteration, and γ is the scheduling rate that decreases the learning rate for the completion of each epoch. If α l = 0 then this depends on the value of l. All of the layers from 1: l are not updated in terms of their weight; hence, the weights are transferred in the final version of the trained model.

A. PARAMETER TRAINING
This error in Equation (6) is a mean squared error, which is obtained by adding the MSE value of the deviation from each of the samples (i.e. training data (t i ) from the predicted value (y L i )). Here, the upper subscript L denotes the output for the final layer. Based on the obtained error (E), backpropagation (BP) is performed to update the weights for each parameter as in Equation (7)  Here, the output of the x l+1 k filter 'k' is the number of filters in the l th layer, and the weights of the previous layer 'l + 1' give the output y l i of the l th layer during the BP. Similarly, the bias is also updated as Equation (8): As a result, it is written for the whole length of 1 to l + 1 layers; hence, it can be summed up as follow for N number of filters in the l + 1 layer to obtain y in the l th layer as in Equation (9): During training, we need to backpropagate the gradient of the error ∂E through this transformation, and to compute the gradients with respect to the parameters as the batch normalization (BN) transforms.
All experiments were conducted using MATLAB R2019a academic software on Windows 10 OS. Network models were trained on NVIDIA GeForce RTX 2070 GPU with 24 GB of memory and tested in Intel R Core TM i5-9600K CPU @ 3.70 GHz with 32 GB of memory. The trained mat file will be provided to researchers upon request to the authors. VOLUME 8, 2020

IV. TEST ON DIFFERENT CNNs
In order to define an optimal number of layers for our input of 64×64 × 64.3D scan, we tested from an initial layer of single encoder i.e. Convolution-Batch normalization-ReLU-max-pooling, and stated it as an L1 layer. Similarly, the encoder blocks were further implemented on the L2, L3, L4, L5, and L6 layer consecutively. In L6, the final feature size from the sixth convolution was [2 2 2] for each of the 64 filters. This means that the filter kernels have only two pixels in length for each filter; hence, expanding this to the L7 layer would be an inoperable idea and will ultimately reduce the features. Hence, we didn't use seven convolutions based architecture. Table 1 shows the result of classification on these layer-wise CNN, whereas Table 2 presents the result of classification using four different architectures based on the reception area i.e. window size of the convolution kernel. Similarly, the training and validation graph was also studied to observe, how the architectures affect the training and also help to better understand the convergence process of each CNN, Figure 3. Correspondingly, to understand the extracted features, from each convolution layer, a single MRI from each target domain was passed and the feature was observed as in Figure 4. On minute observation we could find the difference in the lines, edges, intensities, and other patterns based on the class domain. Moreover, FCL layers were visualized using t-SNE projection as in Figure 5 for each architecture, so we could support our finding. Here, the features were visualized for the whole test set, so this will help us to judge which architecture has segregated the feature in a better way. Finally, the results from different hyper-parameter settings and datasets are tabulated in Table 3 and Table 4 respectively.

V. WHY DIVERGING ARCHITECTURE?
The filter size determines the scanning window during the convolution and the size of this window can be analogized as the reception area. We have increased our filter size by two strides in each consecutive layer so that the feature extracted will be sequentially extracted at a low level, an intermediate level, and a high level with a higher area of reception for the successive layers. The low-level features are extracted from the 3 × 3×3 filter window and it is max-pooled by the 2 × 2×2 windows with a stride of one from the first convolution layer (i.e. conv_1 to max-1) [FIGURE 2 (b)]. We call this a diverging network in the sense that the size of the filter kernel keeps on increasing with an increase in the step size or the stride; however, the number of filters in each layer is same (i.e. 64) to maintain the channel size for the input of 64 × 64×64. Beginning from the first convolutional layer, with filter size 3 × 3×3; hence, a minute detail can be easily captured. Once the layer deepens, we can accumulate the features by increasing the window size for each layer. Consequently, the max-pool stride is also increased to reduce the redundancy in the feature. Conversely, the area of the reception keeps on decreasing with an initial filter size of 9×9×9 in the converging network, whereas in the equivalent architecture, a uniform kernel size of 3 × 3×3 is used in each convolutional layer. All of the details in the architecture and the results of the experiment after training and testing are highlighted in Table 2, which includes the parameters in the second column.

VI. PET OR MRI OR BOTH?
To find the effect of the size of the training material, we trained the L4 diverging network with a variety of datasets and the results are shown in Table 4. The used MR images and PET images were all obtained from patients of ADNI BL visits obtained under the ADNI 1 project [41]. We used 3D scans of T1 weighted structural MR images of wholebrain; normalized, and processed using ADNI pipeline also few scaled (listed in Appendix), whereas PET scans were also obtained from ADNI BL; processed for smoothing, coregistration, and few standardized (listed in Appendix). Our experiment showed that MRI is a better imaging modality than PET for 3D CNN classification. When the network is trained with the smallest dataset including MRI1 (see Table 4, 5 th column for the type), the network gets under-fitted; hence, the testing accuracy was low at 74.5%, which is slightly lower than the validation accuracy. However, the training achieved convergence as the accuracy reaches 100%. The same network when trained with the BASELINE_MRI data (type MRI2, see Table 4) under the same environment achieved the highest testing accuracy of 94.5%. The reason behind the increased accuracy may be due to the higher scans per patient ratio (SPR), which decreases the variability for each scan and loses its generality in the network. The PET scan performed the worst in the L4 divNet with increased training time. The BASELINE_PET_SMALL dataset, PET1, has a testing accuracy of only 66.34%, whereas the bulkiest PET dataset (i.e. BASELINE_PET_ALL, PET2) testing accuracy reached only 50.21%, along with difficulties in achieving convergence with 100 epochs and GPU training time almost three times of PET1 though it is ten times bigger in size than PET1. Finally, the MRI2+PET1 datasets were merged and trained in a single network however, it could only reach a 90% training accuracy after convergence and reached the testing accuracy of up to 82%. As a result, it seems like MRI is a better choice for CNN, and PET only has a complementary role for the AD prediction. It is worth mentioning that the PET image is visually not so discriminative by the target class in comparison to the MRI image (see FIGURE 1), which may have resulted in the MRI's better performance.

VII. EXPERIMENTAL RESULT
We present all of the results of our experiments in the tables and figures below. Table 1 highlights the results from the diverging architecturebased configuration with the use of different layers, starting with two convolution encoding layers to six. The parameter column details the filter size, number of filters, maxpool filter size, stride, and FCL input and output number as  Table 1. Remarks: The VL is much less than TL, which indicates a possible overfitting case. (b). The training and validation loss (Y-axis) graph showed under each iteration (X-axis) of 100 epochs for the L2 convolution as presented in Table 1. Remarks: The VL is less than TL, which indicates a possible overfitting case. (c). The training and validation loss (Y-axis) graph showed under each iteration (X-axis) of 100 epochs for the L3 convolution as presented in Table 1. Remarks: The VL is higher than TL, which indicates a possible under-fitting case. (d). The training and validation loss (Y-axis) graph showed under each iteration (X-axis) of 100 epochs for the L4 convolution as presented in Table 1. Remarks: The VL is slightly higher than TL, which indicates a possible optimal case. (e). The training and validation loss (Y-axis) graph showed under each iteration (X-axis) of 100 epochs for the L5 convolution as presented in Table 1. Remarks: The VL is much higher than TL, which indicates a possible under-fitting case. (f). The training and validation loss graph (Y-axis) showed under each iteration (X-axis) of 100 epochs for the L6 convolution as presented in Table 1. Remarks: The VL and TL both have higher values, which indicate a possible under-fitting case.

A. TEST ON DIFFERENT LAYERED CNN
indexed in each row. Training accuracy reached almost 100% for each configuration, whereas the validation and testing accuracy start dropping after the L4 layer. This could be the optimal case as plotted in training and validation loss against the epoch numbers as shown in Figure 3(a) to 3(f), with the remarks for overfitting or under-fitting cases.

B. TEST ON DIFFERENT ARCHITECTURES
As discussed in section IV, the results using different architectures based on the reception area of convolving filter size i.e. the results from 4 architectures viz; diverging, equivalent, converging, and U-net are presented as in Table 2. The parameter column is indexed as same as in Table 1.

C. TEST FOR DIFFERENT HYPER-PARAMETER SETTINGS
As discussed in section II C, hyper-parameters play an important role to reach an optimal case for the best performance of the network so we experimented with several activation functions, initialization techniques, and optimization algorithms to find the best case as shown in Table 3.

D. FIGURE FOR EACH ARCHITECTURE'S CONVOLUTIONAL TRANSFORMATION
Convolutional transformation is visualized using Pseudocode 1; here we present Figure 4 for each class domain anal-ysis, visualized using a single patient MRI scan. The number of features keeps on reducing from the former convolutional layer to the latter one. The result from the L4 diverging architecture network is presented in slice-view, scaled to 64 × 64 for better visualization.

E. TEST ON DIFFERENT DATASETS
Although the network is finalized, still the dataset size should be determined as it can heavily impact the network performance. So, we were interested to see how the number of training material affects the testing accuracy and hence we performed experiments for the different datasets as shown in Table 4. Demographic details and file type are listed in the Appendix.  each architecture type is shown in Figure 5, where we have presented the class-wise representation of figures for the last three FCL used.

VIII. 3D CNN STATE OF THE ART COMPARISON
Hosseini et al. [14] used a deeply supervised adaptable 3D CNN (DSA-3D-CNN) based on the autoencoder network for AD classification that demonstrates feature maps for the various layers. The reported accuracy is 97.06% for the binary classification of the AD/NC using only the MRI dataset. The reported accuracy is from a 10-fold CV, which means that only one MRI in a batch of ten is used in testing, whereas the other nine are used for training and validation. Hence, only 10% of the total image (i.e. 21 subjects) is used for testing [36]. Besides, each image participates in training and testing; thus, the idea of an untouched test set seems to be avoided VOLUME 8, 2020   during cross-fold validation. Oh et al. [35] also performed 5-fold CV with a moderately sized dataset with an accuracy of around 84.5%. Goceri [32] and Gupta et al. [33] reported accuracies of 98.06% and 94.74% respectively, where they used data splitting and tested them in 20% and 10% of the dataset respectively. Although the accuracy is higher, the SPR ratio is still high, which may cause a generalization error. Payan and Montana [34] had an optimal performance for larger data size, with an accuracy of around 89.47% for three classes of AD/MCI and HC. However, here the testing ratio is only 10%, which may suggest the case of possible overfitting. They have trained 3D CNN using 5 × 5×5 patchbased so not a whole MRI itself. Conversely, we tested using the whole MRI and PETs in different data sizes, splitting the data in 5:2:3 ratios for training, validation, and testing. Hence, the 30% untouched data when tested can give us a reliable result.
In Table 5, the term SPR is introduced, which indicates the use of multiple scans from a single patient, but not necessarily at the same time. As a result, multiple MRIs and PETs were acquired from a single patient for SPR greater than '1'; however, the image acquisition and preprocessing steps were different for each of the scans. A lower SPR value can bring variability in the dataset; therefore, the value of '1' indicates a single scan from a patient. This may eventually bring generality in the trained model; however, this can result in a low performance due to the constraint of the limited training material as in our case with the MRI scans, where the accuracy dropped to 74.55% ϕ from our best outcome of 94.5% (see Table 5) ξ . Later to check with the PET, we first trained it with a smaller database with scans from each unique patient (i.e. SPR=1); however, the results were poor. It was then tested with a larger PET database and a higher SPR. This also resulted in a low performance ζ that led us to conclude that PET is not a good choice for image-based 3D-CNN classification. On further tests with PET+MRI as presented in the last row of table 5, we found a moderate result that is merely due to higher true positives from the MRI scans than from the PET. Thorough experiments were performed with a different number of subjects to find the effect of data-size in both MRIs and PETs; hence we did not use the same number of patients.

A. PERFORMANCE ANALYSIS AND DISCUSSION
To study the proposed model performances listed in Table 5, we visualized the convolutional layer as well as the FCL with the help of Pseudo-code 1, 2, and 3. The convolution results have been discussed earlier; here we will discuss the FCL output. FIGURE 6 depicts the distribution of the features for the test image set, which consists of 296 scans that are separated layer-wise during classification from the first convolution to the last FCL. The classification performance of the converging and diverging architecture is the best out of the four selected architectures (Table 2). Even so on the basis of FCL patient-level visualization, as demonstrated in Figure 6, we see that the features for each class start to separate well in the diverging architecture than the converging one. From the first FCL FC 1 to the third FCL FC 3 , the data visualization using t-SNE shows a better separation in the second case (i.e. VOLUME 8, 2020 FIGURE 5. FCL feature visualization using t-SNE 2D feature projection for the different architectures during testing. The colored dots represent single MRI scan features from the test set in the first three FCL, namely FC 1 , FC 2 , and FC 3 . The feature starts to show a class-domain property from an FCL, and it is visualized by the start of the formation of the same colored cluster. Based on the visual inspection, we determined that the diverging architecture-based features are better clustered and separated than the others, Fig 5(d)-(f). Meanwhile, there is poor separation in the case of the U-net-based architecture as shown in Fig 5(j)-(l). Here, the training environment and the training material used for training were all the same; the generated models are detailed in Table 2. The X-axis and Y-axis represent the values of the 1 st dimension and 2 nd dimension obtained from t-SNE 2D projection respectively.
diverging, see FIGURE 6). Similarly, based on the final FCL graph plotted as separate color curves for each cohort domain against the real weights of the final 100 parameters from the trained network without any projection (see FIGURE 8), shows a better demarcation between each colored graph than the 512 parameters from the U-net architecture. Afterward, we moved back to the training curve of these three networks (see FIGURE 7) to finalize the best performance. It was FIGURE 6. Feature visualization using t-SNE 2D projection for the L4 divNet for 296 test images from the BASELINE_MRI data. Each colored dot represents the feature of a single MRI of the indexed class. This starts from the 1 st convolution to the 4 th convolution (i.e. from Fig (a) to Fig (d)). The features from similar groups start to segregate, and it can be distinctly visualized from the 1 st FCL (i.e. FC 1 , Fig (e)). It continues until the last FCL (i.e. FC 4 ), where only a few colored dots are found in the wrong cluster (Fig (h) near the green CN group and a few in the blue MCI group). This overlapped region may be due to the possible false positives or false negative predictions that are subjected to errors in the test set prediction. The X-axis and Y-axis represent the values of the 1 st dimension and 2 nd dimension obtained from the t-SNE 2D projection respectively. observed that the validation loss is significantly higher than the training loss in the converging and equivalent architecture. This indicates that the network can still be optimized, which was achieved with a diverging architecture and proper hyperparameter selection.
Regarding the hyper-parameter selection, it is important to maintain proper and timely training and good performance of the trained model. Concerning our experiment, the L4 diverging architecture was the best among the selected architecture as specified in Tables 2 and 3, whereas important hyperparameters like the initialization, activation, and optimization algorithms were selected using Table 3.

B. GENERALIZATION AND OVERFITTING PROBLEM
If we look at the recent architectures [14], [32]- [34] and the performance results, we find that the reported precision and accuracy rate are very high, more than 90%. In MR-guided image acquisition, various technical specifications like acquisition instrument, spatial positioning, contrast intensity, plane orientation, registration template, correction method, and a wrapping protocol can bring variability in the MRI of the suspected class [40]. Hence, a neural network trained on one 'variety' of an MRI, may find it ambiguous to detect an MRI of the same target class, if this is acquired differently, this causes a generalization error in the network. The generalization error is one of the leading challenges in medical imaging diagnosis. In this case, we have tested our network/model with other data from the ADNI, which we denoted as MRI_adapted. This is because it was partly adapted from [39] which differs in participants under the ADNI project. The MRI_adapted dataset was used only for testing of the generalization, which consists of 135 AD, 162 CN, and 134 MCI 3D scans; the testing results are presented in FIGURE 9. The other way to scrutinize could be with Training graph plotted against the training loss and validation loss in the Y-axis and the corresponding iteration number in the X-axis. By having more iteration numbers, the longer the epochs are. The training plot of the converging architecture has a validation loss that is much higher than the training loss. This may cause a poor performance, which is similar in the case with an equivalent architecture. However, the validation loss is quite reduced in the diverging architecture; thus, making it the optimal choice. Here, the training material and the training environment are identical in all three cases.

FIGURE 8.
Final FCL weights values plotted in Y-axis directly for three target domains separately for each tested architecture using Pseudocode 3. X-axis extends from 0-100 for first 3 graphs whereas it extends from 0-512 in fig (d). The first three graphs have 100 parameters before producing the final three outputs for the softmax classifier whereas U-net has 512 parameters. the visualization of the features. By extracting the better features, CNN will learn better. Similarly, overfitting is a contemporary part that comes with the generalization error. A non-generalized model learns 'too well' so that it only memorizes the training pattern that causes overfitting. Once we solve the overfitting problem, generality is also achieved.

IX. CONCLUSION
CNN like ANN, itself is a semi-supervised learning algorithm that doesn't require prior heavy feature engineering, and its self-auto generic feature extraction property is already discussed in section II. Few researchers have been suc-cessful to develop optimization algorithm as [32], however, the most important contribution is the design of the better architectural unit itself [33]- [35], whether simple or complex, the result should be satisfactory and properly analyzed, which we think we have done to some extent. Besides, the prevailing techniques are mainly 2D image-based methodology, so the 3D architecture based concept is itself an initiative approach. This concluding section summarizes the key points that may be helpful for other researchers working with medical imaging in the same field with 3D CNN.
• The deep learning process heavily depends on the choice of training materials. Closely related images (training) can enhance the training performance; however, it can simultaneously 'spoil' the model due to overfitting. 'Good data' rather than 'big data' is required to generate a good network. The generality test with an entirely different dataset that was not involved in training and was acquired from another ADNI project [39].
• Although our trained CNN is not deep enough to prototype a human brain structure, unlike reconstruction and segmentation models, it is however good enough to classify the MRIs, based on the segregated features learned in the convolutional layers, which is the actual aim of our study.
• MRI can be a better choice than PET for image-based CNN models. This may be due to its diverse pixel value of the MRI.
• Selection of hyper-parameters like initial-learn rate, learn-rate drop factor, the activation function and the initialization algorithm can affect the training process, but it has little effect on its performance once the convergence is achieved.
• The architecture and depth affect the performance of the model thus, it is very important to have a generalized cum optimized model. In regards to the selection of features, we are convinced that the diverging window or reception area in each layer will be more beneficial than the contemporarily used converging or equivalent reception area.
• 'Overfitting' and 'generalization' problems are the biggest challenges for deep learning models.
• Since we have proposed an optimized DL based CNN for classification of AD, MCI, and NC using MRI/PET, it will assist the medical clinicians as an initial rapid test to identify the patient's condition using brain image scans only. Besides, MCI being an early stage of dementia means MCI identification will also help in the early prognosis of AD.
Based on our findings we hope this can be helpful in many ways to researchers working in the same field of MRI/PET classification. Our study here is limited in the ADNI dataset, and may not act as universal CAD for AD detection yet more avenues are to be explored. The constantly developing deep learning methods can prove to make this process more optimal, robust, rapid, and automatic, with a minimum level of human supervision.

APPENDIX
Appendix 1 with a list of files for MRI and PET types in Table 1 along with demographic details in Table 2. Also, a high-quality visio image for FIGURE 2 is presented.