Exploring the Structural and Strategic Bases of Autism Spectrum Disorders With Deep Learning

Deep learning models are applied in clinical research in order to diagnose disease. However, diagnosing autism spectrum disorders (ASD) remains challenging due to its complex psychiatric symptoms as well as a generally insufficient amount of neurobiological evidence. We investigated the structural and strategic bases of ASD using 14 different types of models, including convolutional and recurrent neural networks. Using an open source autism dataset consisting of more than 1000 MRI scan images and a high-resolution structural MRI dataset, we demonstrated how deep neural networks could be used as tools for diagnosing and analyzing psychiatric disorders. We trained 3D convolutional neural networks to visualize combinations of brain regions, thus representing the most referred-to regions used by the model whilst classifying the images. We also implemented recurrent neural networks to classify the sequence of brain regions efficiently. We found emphatic structural and strategic evidence on which the model heavily relies during the classification process. For instance, we observed that the structural and strategic evidence tends to be associated with subcortical structures, including the basal ganglia (BG). Our work identifies the distinct brain structures that characterize a complex psychiatric disorder while streamlining the deductive reasoning that clinicians can use to ensure an economical and time-efficient diagnosis process.


I. INTRODUCTION
Autism spectrum disorders (ASD) is a term embodying neurodevelopmental disorders characterized by persistent insufficiencies in social communication as well as restricted and repetitive behaviors, interests, or activities [1]. According to a report from the Centers for Disease Control and Prevention (CDC) in 2018 [2], one out of 59 children in the The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Alawneh . United States has ASD symptoms. In the Republic of Korea, the prevalence of ASD is estimated to be 2.64% among school-age children [3].
Studies using neuroimaging techniques, such as magnetic resonance imaging (MRI) or positron emission tomography (PET), have provided many insights into the neurodevelopmental characteristics underlying ASD [4]- [8]. Most findings from these imaging studies are based on a univariate analytical approach assuming the independence of each voxel [9], [10]. In contrast to mass-univariate methods, machine VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ learning models can use multiple voxels as inputs, making it possible to study high-level relationships between different features. These models are capable of identifying the differences between a disease and control group, while suggesting a suitable diagnosis strategy for each subject [11]. Machine learning models have been successful in solving various disease classification problems in ailments including Alzheimer's disease [12]- [15], schizophrenia [16], [17], attention deficit hyperactivity disorder [18], [19], and other psychiatric diseases [20], [21].

II. RELATED WORK
Rapid advances in deep learning have allowed the integration of various data, including data with different modalities [22]- [27]. Several studies have demonstrated the utility of deep learning in medical problems [28]. For example, a fusion of latent feature representations extracted from MRI and PET data has been used in diagnosing Alzheimer's disease [29], [30]. Deep learning has performed well in learning complex patterns, such as functional connectivity, making it potentially helpful for diagnosis purposes [31]. ASD is characterized by persistent deficits in social communication and interaction as well as restricted and repetitive patterns of behavior, interests, or activities. The causes of ASD are still unknown, but some researchers hypothesize that the structure of the brain contains relevant information [32], [33]. The data consist of volumetric measures and the structure of the cerebellar vermis [34], regional thicknesses extracted from the surface-based morphometry [35], the volumes of gray and white matter maps [36], [37], volumetric and geometric features extracted from selected cortical locations, and morphometric features of selected regions of interest [38]. A few studies have reported a relatively high accuracy, between 76% and 90%. However, these studies involved performance measurement of classifiers conducted on small datasets, usually consisting of less than 50 participants [39]- [41]. Moreover, the body of research has yet to produce robust algorithms or out-of-sample performance.
When these tests were implemented on a large-scale dataset collected from different populations and places, their performance significantly decreased. One study used MRI samples from the Autism Brain Imaging Data Exchange (ABIDE) to define the histogram of oriented gradients, obtaining an accuracy of 60.1% [42]. In another study, two different types of neural networks were used to process MRI data. This study achieved an accuracy of 61.7%. The models reportedly performed better on relatively large-scale MRI datasets [43]. Weights from the convolution neural networks (CNN) were replaced with weights from the pre-trained sparse autoencoder network. In addition to inadequate classification performance, the models carry poor transparency. In other words, the factors affecting the model's decisions remain ambiguous whilst classifying each subject. Such factors can be used as indices measuring model suitability. After preprocessing 1113 sMRI samples from the Human Connectome Project (HCP) data set (http://www.humanconnectome.org/ for technical information) and screening out sMRI data with similar age and gender ratios in the ABIDE, we used an encoder to classify subjects with autism [44], [45]. This model can predict the neuroanatomical deviations associated with autism compared to a control group [46].
In addition to the sMRI-based classification, other studies also have used the fMRI data. Based on Pearson's coefficient, 19900 Region of Interest (ROI) features were selected from the CC200 functional parcellation atlas of the brain [47], and an autoencoder was used to classify autism, with an accuracy of 0.743. Similarly, by using the parcellation atlas [48], the temporal features of each ROI in the rs-fMRI data were calculated and fed into the 1D convolutional neural network, leading to a classification accuracy of 81% for the ABIDE-ETH1 dataset [49]. Another study employed a cross-validation grid search method was used to compare multiple classification models such as support vector machines, logistics and ridge regression. The classification accuracy was 71.98%. Researchers further analyzed the seven different brain atlas CC400 to identify autism correlated and anti-correlated region of interests in the brain [50].
Model comprehensibility is particularly crucial in diagnosing psychiatric diseases, especially when the causes of the disease are not fully known. Finding solutions to these fundamental issues is a necessity for enhancing both the reliability of classification performance and interpretability of the model's decision.

III. EXPLORING THE STRUCTURAL AND STRATEGIC BASES OF AUTISM SPECTRUM DISORDERS
To resolve these issues, we conducted large-scale simulations comparing the classification performance of five different categories of deep learning models, including convolutional neural networks, recurrent neural networks, and spatial transformation networks. We were able to visualize the results of each model. The simulations were carried out on two different neuroimaging datasets: one is from the Child Psychiatric Clinic at Severance Hospital, Yonsei University College of Medicine (YUM) which had a high-resolution structural MRI and another is from the international Autism Brain Imaging Data Exchange (ABIDE).
First, we carried out an extensive model comparison for reliable performance evaluation between a number of classifiers using various network architectures. Second, we explored the structural bases of ASD by visualizing a combination of brain regions, which can be considered the bases of the model's classification decision. Further, we included invariant classifiers in our study to effectively deal with variations in size and translation. Our findings suggest the possibility that ASD patients have distinctive structural signatures in their brains. Last but not least, we used attention-based recurrent neural networks to learn a sequence of the brain regions, leading to classification. This sequence provides a better understanding of the background strategies used by the models while classifying the data. Revealing such strategies pointed to regions of the brain for assessment when making a diagnosis. These strategies can make the diagnosis process more economical and time-efficient by providing a useful order of the regions associated with diseases. We observed compelling brain regions for the model's classification, particularly multiple subcortical structures, including the basal ganglia. Overall, these results provide both structural and strategic information for characterizing ASD, as shown in Fig. 1.

A. DATA AND PRE-PROCESSING
In our study, we used two MRI datasets for autism classification research, the first collected by the Yonsei University College of Medicine (YUM). The second dataset was obtained from the Autism Brain Imaging Data Exchange (ABIDE) website, which houses a large number of open-source MRIs for autism research [51].
For the YUM dataset, according to the sample image quality, we selected 73 out of 84 samples, including 40 people with high SCQ points and 33 people with low SCQ points. All subjects gave informed consent, and the Institutional Review Board of the Severance Hospital of Yonsei University approved the study for research with human subjects. We performed this study at the Yonsei University College of Medicine. In addition, we confirm that all methods were performed in accordance with the relevant guidelines and regulations.
For the ABIDE dataset, after combining the ABIDE I and ABIDE II databases and screening the MRI data for suboptimal quality, there were 1,992 people in total, with 946 autism patients and 1,046 people as controls.
The ABIDE dataset is a combination of sets of MRI scans taken independently by more than 24 organizations, leading to inconsistency in MRI quality and dimensions. As a result, the dataset required cautious preprocessing.
For the YUM dataset, the processing pipeline consists of three steps: The pre-processing method employed for the ABIDE dataset differs from the YUM dataset. Because of the dissimilar configuration and quality of each dataset, we employed Statistical Parametric Mapping software (SPM8) to perform the registration [52]. The ABIDE pre-processing pipeline consisted of two steps: (A) non-linear spatial transformation of the MRI to the Montreal Neurological Institute (MNI) T1 template [53]; and (B) normalization of the voxel value to a range of [0,1]. In step (A), we used the default setting of the bounding box, which was [-78, -112, -50] to [78,76,85], and the voxel size, which is 2 mm × 2 mm × 2 mm, in the SPM8. The size of the MRI after registration became 79 × 95 × 79.

B. MODELS
In this paper, we used five main model configurations for classifying and visualizing the samples, as shown in Table 1. Some of them have several model subtypes. For example, we can use the 2D CNN or 3D CNN to process 2D MRI or 3D MRI input.
We illustrate the architecture of model type 2 in Fig. 2A. The models combine the use of STN into the traditional CNN to look at the specific part of the MRI. There are also four model subtypes, which are 2D input+3D CNN+2D STN Because we had to deal with the 3D input data, we modified the original the STN model to the 3D version of the STN [54], so-called 3D STN. That is, the 3D STN receives the three-dimensional input, and the spatial transformation matrix τ has been changed to 4 × 4, as follows where s x , s y , s z are the scale factors for each dimension and t x , t y , t z are the translations for each dimension.
We have depicted the architecture of model type 3 in Fig. 2B. There are two types of models: 3D input+2D CNN+3D STN+RNN (3-1) and 3D input+3D CNN+3D STN+RNN . The architecture for model type 4 is shown in Fig. 2C -Fig. 2D. There are two types of model: 2D input+2D CNN+CAM (4-1) and 3D input+3D CNN+CAM (4-2). The core idea of CAM is to use the global averaging pooling (GAP) layer, F k = f k (x, y) for every (x, y) in order to calculate the importance of each slice of the feature map from the last convolution layer before creating a heat map for a given image using   For each time step in the RNN, the previously hidden state h2 in the second layer becomes the input of the STN to output the spatial transformation matrix. Then the STN uses it to transform the original 3D MRI image spatially [54]. (C) In model subtype 4-1, for each slice of the original MRI, their processing pipeline are the same and independent. The slice is input into a 2D CNN and becomes the 3D feature map. Then we use the 2D GAP to process each slice of this 3D feature map and fully connected to a single unit. Then all these single units from each slice are fully connected to the last layer for classification.
(D) In model subtype 4-2, the MRI is fed into the 3D CNN, and we use 3D GAP to process each cube after the convolution. (E) Model type 5. For each time step of the RNN, the previous hidden state is fed into an FC layer, called the location network, to output the attention location. We use this location to extract the cubic patch, which is called the glimpse network. Then we use the FC layer to process the location and cubic patch to get location and glimpse features respectively and combine them.
where c stands for class, f k (x, y) is the activation value of k th unit in the last convolution layer at the specific point (x, y), w c k stands for the weights of the FC layer that connects the unit in the GAP layer with the k th unit in the output layer [56]. Because we used (2) for 2D images, we can call the (2) the 2D GAP. For 2D input+2D CNN+CAM (4-1), we use the 2D GAP layer for the feature map of each slice's last convolution layer in the MRI image F k ij = f k i (x, y) for every (x, y) and the (2) has been changed to where i is the i th MRI slice, and w c k and w c ik stands for the weights in the last two FC layers. The remaining symbols are the same as in (2). For 3D input+3D CNN+CAM (4-2), only the feature map's dimension after convolution has been changed to 4-dimensional, so (2) becomes where f k (x, y, z) is the feature value of k th unit in the last convolution layer at the specific point (x, y, z).
For model type 5, we show the architecture in Fig. 2E. We change the traditional recurrent attention model from 2-dimensional image input to 3-dimensional MRI input [57]. We set the center of the MRI image to be the starting point of the RAM model. Initially, RAM used the last hidden state of the RNN for classification and did not have the location constraint in the cost function. From the experimental results, the location network inside the RAM always outputs the coordinates near the corner of the MRI, which means it converges to the local minima quickly. Thus, we added a constraint function (5) into the cost function in order to assist the RAM in learning more useful information and reaching the global optimum.
where C is a constant value. The location (x, y, z) in the image has been normalized to the range of [0,1], with (0,0,0) being the top left corner of the image and (1,1,1) being the bottom right corner of the image. The equation above forces the RAM to focus on the central part of the brain. If not, it will be challenged by a constant value, C.

C. IMPLEMENTATION DETAILS
For training and testing data, we separated the training and testing data to be 80% and 20% of the original database. We made the percentage of patients with autism in the training data the same as in the original database. For each type of model, we used the 10-fold cross-validation method.
For hardware configuration, we primarily used an Intel Core i76700 CPU @ 3.40GHz × 8 processor and a TITAN Xp/PCIe/SSE2 graphics processing unit.
We used the network architectures shown in Table 2. 2DCNN (f h /f w , ks, s) is the abstraction of the 2-dimensional convolution layer with f h number of filters for the YUM dataset and f w for the ABIDE dataset, ks is the kernel size, and s is the stride. If f w is not specified, it means the YUM and the ABIDE datasets share the same number of filters .  3DCNN (f , ks, s) follows a similar definition.
2DMP(ps, s) is the abstraction of the 2-dimensional max-pooling layers withpool size and stride. 3DMP(ps, s) holds a similar definition.
BATCH () is the abstraction of batch normalization, while DROP(p) is the abstraction of the dropout layer with p probability. FC(k) is the abstraction of a fully connected layer with a k output unit. RNN (k) is the abstraction of the recurrent neural network with k output unit, 2DGAP() is the abstraction of the global averaging pooling layer, so as 2DGAP() for different dimensions.
For model type 4, the 2D input+2D CNN+CAM (4-1) and 3D input+3D CNN+CAM (4-2) are shown in Table 2, where the superscript indicates that these layers are used for each slice of the MRI image repeatedly and independently. We used the central part of the original images as an input for the models.
Model type 5, it is rather awkward to summarize simply using a table. We give the implementation details of each network as described in [10]. At each time step, the glimpse network extracts three cubic patches inside the MRI image, with the size of the first cubic patch being 4 × 4 × 4, and each successive patch having twice the width, height, and depth of the previous. After extracting and resizing them to the same size, we flattened the three cubic patches and inputted them into the fully connected layer with 128 units. The location network takes the location coordinate as input to the fully connected layer with 128 units. We then concatenated the glimpse feature from the glimpse network and the location feature from the location network into the combined feature, inserting them into the RNN with 256 units. After eight-time steps or glimpses, the hidden states of the RNN were used for classification. For 3D input+RAM+loc (5-1), the location constraint cost function is adopted inside the model. For 3D input+RAM+noloc (5-2), no location constraint cost function is used inside the model. For 3D input+RAM+loc+fc (5-3), the location constraint cost function is exploited inside the model and uses all the hidden states information for classification, while omitting the others. We set the center of the MRI image to be the starting location of the RAM model (5-1) to . For 3D input+RAM+rand (5-4), the location network of a Gaussian distribution function centered at 0 is replaced with a 0.6 standard deviation within the model. We found that, even with the RAM+noloc model, we could still reach a relatively high accuracy as RAM+loc, implying that the attention regions after the first time step of RAM are meaningless.

D. DATA AVAILABILITY
The ABIDE dataset analyzed during the current study is publicly available on http://fcon_1000.projects.nitrc.org/indi/ abide/abide_II.html. Moreover, the YUM dataset that supports the findings of this study are available from Severance Children's Hospital, the Institute of Behavioral Science in Medicine, Yonsei University College of Medicine. However, restrictions apply to the availability of these data, which were used under license for the current study, and hence not publicly available.

V. RESULTS
The crossed-out cells refer to simulation conditions that cannot be run on a standard GPU server due to tremendously high computation costs. 3D and 2D input: both a whole and a single slice of the given MRI image were given as input to the classifier, respectively. CNN: a convolutional neural network,   Table 4. For example, RAM+fc refers to the model 5-3 in Table 4. AE+CNN refers to the auto-encoder+CNN model used in [42], and the 50% horizontal line refers to a chance level. The mean test accuracy was recorded every 3000 and 5000 training steps. The shaded area represents a 95% confidence interval.
STN: a spatial transformer network, RNN: a recurrent neural network, CAM: a class activation mapping, RAM: a recurrent attention model, loc: a local constraint where an input space was confined to the brain area for the sake of efficiency of learning, noloc: a local constraint was not applied. fc: a fully connected network, rand: random location  Table 4 shows the details of each combination and the corresponding test accuracy during 10-fold cross-validation (CV). The first four categories are based on an invariant method (CNN) combined with various feature visualization techniques (STN and CAM), whereas the fifth type is based on a sequence learning model (RAM).
The YUM sample consists of 84 subjects (3yr-11yr) with MRI and Social Communication Questionnaire (SCQ) data (see Table 5). Two pediatric psychiatrists at Yonsei University Severance Hospital diagnosed the children as ASD based on DSM-V (see Methods for complete details).
We divided the data into two groups: low and high SCQ, with an SCQ score of 15 set as the threshold ( Table 5). The ABIDE dataset is an open-source MRI data repository for autism research (see Methods for more details). The classification accuracy as a function of training epochs is shown in Fig. 3.
We found that the 2D/3D CNN and the RAM performed the best for the YUM dataset, whereas a simple 3D CNN performed the best for the ABIDE dataset (see Fig. 3 and Table 4). Note that the 3D CNN model outperforms the model reported in the previous study [42]. Visualization of the feature learned by class activation mapping. The heat maps generated by the CAM and the corresponding local maxima (red dots) for model 4-1 are superimposed on an input brain structure image. Note that to improve computational efficiency and preclude the adverse boundary effect of the model's convolution kernels on CAM results, the results were confined to the region where the brain images are located. To ensure the reliability of the simulation, we acquired the CAM results by running ten cross-validation experiments. For (B), the local maxima are discovered within ten voxels. Refer to Table 3 for the full list of highlighted regions and corresponding MNI coordinates.  Table 4). The test accuracy was measured over 10-fold cross validation. The average test accuracy, indicated by colored dots, was recorded every 3k and 5k training steps for the YUM and the ABIDE, respectively. The training continued until reaching maximum 100k steps. The shaded area represents 95% confidence interval.

B. TRAINING VARIOUS TYPES OF NEURAL NETWORKS FOR ASD CLASSIFICATION
In order to examine which set of input features contributed significantly to the models while categorizing the subjects, VOLUME 8, 2020 we implemented two types of models, each with different characteristics. The first approach was to optimize a linear transformation of input images for classification. We trained the STN on the YUM dataset, a neural network capable of learning an optimal affine transformation of the input image for use in the classification task (refer to model types 2 and 3 in Table 4). The trained STN showed that the optimal input transformation involves cropping the central part of the original 3-dimensional MRI images and then enlarging it to the size of the original image (Fig. 4). This finding suggests that the subcortical structure of the brain might be influential in classification. For the rest of the cases in types 2 and 3, the STN did not learn any meaningful input transformation (data not shown).
The second approach involves training an invariant classifier, such as the convolutional neural networks (CNN) before visualizing the input features that contribute to the model's meaningful classification. We adopted the class activation mapping (CAM) algorithm, which distinguishes a group of informative features from others in the given input. This algorithm estimates the degree of each feature's contribution to the classification (refer to the model type 4 in Table 4). In our work, we have implemented the CAM to create a heat map representing the extent to which the corresponding pixel value contributes to the CNN's classification. We stacked an input image for which the model makes an accurate prediction and its corresponding heat map to visually highlight a particular region of the image that contributes significantly to the model's classification. True positive data are explicitly selected as inputs for the CAM. The heat maps are generated by combining every output of each CAM result for each sample corresponding to the model (4-1) (Fig. 5). Interestingly, local maxima were found in subcortical areas, including the head and the tail of the caudate nucleus (slice #78). A few local maxima also were found in the cortical area, including insular and inferior frontal gyrus (slice #74). Another interesting observation is that the local maxima also includes brain structures with heavy connections, such as claustrum that connects subcortical to cortical areas (slice #74) and corpus callosum that connects the two hemispheres (slice #104). To prevent boundary effect misinterpretations of the model's convolution kernels on CAM results, we excluded the top and bottom eight slides from analysis. Note that most of these brain regions are implicated in decision making, learning, and inhibitory control. One interesting possibility is that these structural differences can contribute to atypical behavior in people with autism spectrum disorders. Note that unlike model (4-1), model (4-2) seems to suffer from an overfitting issue. This issue culminated in less reliable CAM results, which do not warrant discussion. We were not able to apply the CAM to the ABIDE dataset due to impaired visualization of the classification performance, signifying that accuracy did not exceed the chance level. We used a binarized mask extracted from a probabilistic subcortical nuclei mask with the threshold probability 0.5 [58]. The information of attention boxes was extracted from the recurrent attention model trained on the YUM. The blue and red bar refers to the low (LSCQ) and the high SCQ group (HSCQ), respectively. The yellow bar refers to the case with random sampling. The error bar represents 95% confidence interval. The asterisk indicates statistical significance (p<0.05; paired t-test between LSCQ/HSCQ and Rand).

C. UNDERSTANDING THE STRATEGY BEHIND THE DECISION MAKING
The models that belong to the first four types (types 1 to 4) adhere to single-shot classification, directly predicting the class label for the entire input image. Although the CAM has remarkable ability in visualizing a correlative basis, it lacks the capability to describe causalities between the features of the input image data. In order to discover the optimal strategies to use for accurate classification, we used a recurrent attention model that learns a sequence of voxels (partial brain regions) that the model needs to consider during classification. An optimized sequence can be considered as a set of aptly ordered readouts of brain structures, which ultimately serve as an effective guide for classifying the data. This approach corresponds to the models belonging to type 5.
All of the type-5 models rapidly identified the optimal input sequences for classification and exceeded 70% accuracy within the first 150K training steps (Fig. 6). For both datasets, a successfully trained model shows a relatively stronger tendency to identify the subcortical structure, including BG (Fig. 7). To formally quantify this effect, we computed the ratio of overlap between the model's attention boxes and basal ganglia (BG) (Fig. 8).

VI. DISCUSSION
We investigated how models comprised of deep neural networks can be applied to identifying individuals with a complex psychiatric disorder such as ASD. The overall architecture is summarized in Fig. 1. We primarily used the CNN and RNN as analysis and diagnosis tools, building them with various architectures. We measured the performance of every model on classification tasks, with each task using a different MRI dataset.
Unlike conventional approaches that extract morphological features using traditional algorithms, we directly fitted neural networks to the original MRI voxel data, finding the structural difference between the autism and control groups. Our end-to-end training regime does not require extraction of human morphological feature information, reducing the risk of missing information and causing errors in the extraction process.
Note that this paper aims not only to reliably enhance classification accuracy, but also and more importantly, to explore structural and strategic ASD evidence. We achieve this goal by using a relatively large sample size and by exploring a variety of different model versions, including 2D/3D CNN, STN, and RAM. For example, RAM provides the logic of classification (Fig. 6); however, the ABIDE dataset's test accuracy is slightly lower than the best version. There are several reasons why it is challenging for YUM and ABIDE to achieve consistent accuracy: • Data variability: ABIDE is a collection of data from more than 20 institutions, each with different scanners, scanning protocols, and configuration parameters, making image features very different from those included in the YUM data. Transferring ABIDE data to the MNI152 standard template unavoidably caused image variability. On the other hand, the YUM data set had relatively smaller variability because it was collected by the same facility. This fact might explain why the RAM showed strong performance for the YUM in comparison to the versions based on the invariant method, such as 2D or 3D CNN.
• Sample size: The sample size of ABIDE involves more than 1000 images, whereas the YUM contains only 84. It is generally known that CNN models show reliable performance when the sample size is sufficiently large (ABIDE). However, attention-based models, such as RAM, hold an advantage when the sample size is very small (YUM).
• Structural heterogeneity: ABIDE includes a very broad age range for patients with autism, implying substantially higher heterogeneity than YUM (refer to both Table 5 and Table 6).
• Spatial resolution: The YUM consists of high resolution sMRI. The spatial resolution of YUM is higher than that of ABIDE.
• Class labeling: The method for labeling the ABIDE data differs slightly from that of the YUM data, which relies on the SCQ (Social Communication Questionnaire) index. We built the CAM and a diagnosis sequence generator on top of the CNNs and the RNNs, respectively. The CAM numerates the contribution degree of each input. In other words, the algorithm computes a value that represents how often and how strongly the model refers to a particular feature during the classification tasks. Psychiatric physicians can use this type of analysis tool to identify significant brain regions during the diagnosis process. We also have run both the grad-CAM and the guided grad-CAM on our dataset. Despite much effort to fine-tune these models, visualization results are slightly noisier and less reliable than those done with CAM. The input of 3D CAM and 2D CAM differ due to differences in structure, 3D volume and 2D slice, respectively. This variance also explains why 3D and 2D CAM offer different results in some areas. That being said, based on the overall statistical analyses, we found that the results from these two models consistently overlapped in the thalamus, caudate nucleus, claustrum, and other subcortical tissue areas. Further, applying the RNN generates an optimized sequence of the brain regions, which can serve as a remarkable index for clinicians. The generator provides rigorously ordered brain regions to aid in diagnosis. Such structural and strategic clinical models may be state-of-the-art indicators of ASD. Using these models in clinical settings may positively impact individual patients while increasing efficiency and economic benefits for the community at large.
The major regions in the classification were subcortical structures, including the BG. The BG, which itself consists of the striatum, caudate nucleus, globus pallidus, and putamen, is a group of subcortical structures involved in motor function as well as learning and memory. BG is suspected to contribute to repetitive and stereotyped behaviors, which is a core symptom domain of autism spectrum disorder. Despite the limited implications of the BG's role in autistic symptoms, there is little evidence from previous high-resolution MRI (≥3T) studies. Our results (Fig. 8) strongly support the idea that the BG area could be a potential biomarker of autism.
To the best of our knowledge, Ghiassian and Sen's papers are the only two demonstrations using automated learning methods to classify the autism patient using extensive databases. There are a few differences between our model and the models used in previous studies. Firstly, Ghiassian's study relies on a hand-crafted histogram of oriented gradients, which may be prone to subjective bias. In contrast, we employed an end-to-end training regime for classification. Secondly, unlike Sen's study, our study adopted auto-encoders for data reconstruction. We were able to avoid weights transfer, which usually is used in the classification task. Thus, the filter number does not necessarily match the number of units in the hidden layer of the sparse autoencoder. Third, we used a 3D-CNN that learns the complex spatial patterns of features. This setting reflects our perspective on a volume or thickness of gray and white matter such that they can be good indicators of ASD. Note that our model outperforms the 2D-CNN by 2.8% in overall accuracy.
The reported classification accuracy may be considered inadequate to reach the level for clinical utility. Despite this technical insufficiency, our study provides a useful protocol for visualizing elements with neural networks learning from the data, as well as perceiving their relationships. These findings will allow profound clinical insights into ASD diagnosis. Our study blazes a trail in discovering structural and strategic evidence for acknowledging complex psychiatric symptoms, thereby guiding clinicians in refining currently-available diagnostic tools. Kang drafted the manuscript, and S.W. Lee and K.A. Cheon commented and revised it. All authors approved the final version of the manuscript for submission. The source code in this paper can be found in https://github.com/brain-machineintelligence/Autism-Classification.