Deep Spatio-Temporal Representation and Ensemble Classification for Attention Deficit/Hyperactivity Disorder

Attention deficit/Hyperactivity disorder (ADHD) is a complex, universal and heterogeneous neurodevelopmental disease. The traditional diagnosis of ADHD relies on the long-term analysis of complex information such as clinical data (electroencephalogram, etc.), patients’ behavior and psychological tests by professional doctors. In recent years, functional magnetic resonance imaging (fMRI) has been developing rapidly and is widely employed in the study of brain cognition due to its non-invasive and non-radiation characteristics. We propose an algorithm based on convolutional denoising autoencoder (CDAE) and adaptive boosting decision trees (AdaDT) to improve the results of ADHD classification. Firstly, combining the advantages of convolutional neural networks (CNNs) and the denoising autoencoder (DAE), we developed a convolutional denoising autoencoder to extract the spatial features of fMRI data and obtain spatial features sorted by time. Then, AdaDT was exploited to classify the features extracted by CDAE. Finally, we validate the algorithm on the ADHD-200 test dataset. The experimental results show that our method offers improved classification compared with state-of-the-art methods in terms of the average accuracy of each individual site and all sites, meanwhile, our algorithm can maintain a certain balance between specificity and sensitivity.

can maintain a certain balance between specificity and sensitivity.

I. INTRODUCTION
A TTENTION deficit/Hyperactivity disorder (ADHD) is a neurodevelopmental condition characterized by core symptoms such as inattention, hyperactivity, and impulsivity [1]. ADHD is one of the most debilitating childhood illnesses. Approximately 65% of cases will last to adulthood [2] and seriously affect the study and work of patients, causing a heavy burden to families and society. Mental health experts often use the Diagnostic and Statistical Manual of mental disorders (DSM) developed by the American Psychiatric Association to help diagnose ADHD [3] in clinical practice. At present, ADHD is only diagnosed after clinical review by an experienced child psychiatrist, in addition to discussions with the child's parents and teachers. However, diagnoses are often inconsistent since the diagnostic process is greatly affected by subjective assessment. Therefore, it is essential to find a consensus method to diagnose ADHD according to the existing medical means [4].
Functional magnetic resonance imaging (fMRI) is a widely used noninvasive tool to measure brain activity and highlight the slow fluctuation of blood oxygen level dependence (BOLD) between brain regions during task states or resting states [5]. With the development of machine learning, scholars have paid more attention to the prediction of neurodevelopmental diseases with fMRI data like Alzheimer's disease [6] (AD), Autism spectrum disorders [7] (ASD), ADHD [8], etc.
To promote research in disease imaging of ADHD, the ADHD-200 consortium held the ADHD-200 global competition in 2011 supported by the International Neuroimaging Data-sharing Initiative (INDI). The competition aimed to develop imaging classification methods of patients with ADHD. The ADHD-200 dataset consists of rs-fMRI and structural magnetic resonance imaging (sMRI) images of approximately 800 subjects, which are collectively provided by eight scientific research institutions, such as Kennedy Krieger Institute (KKI), New York University Medical Center (NYU), Oregon Health and Science University (OHSU), Neuroimage Sample (NeuroImage) and Peking University (Peking), etc. The competition aimed to determine the prediction accuracy of each team for typically-developing (TD) and ADHD patients (including the prediction accuracy of ADHD subcategories), and J-statistics (including sensitivity and specificity). ADHD-200 also trained an image-based classifier to distinguish three types: mixed type (ADHD-I), inattentive type (ADHD-II) and TD [9]. The highest accuracy achieved using the imaging data was 60.51% in 2011.
Many researchers have exploited the data from this competition to carry out various studies on ADHD. For instance, Dai et al. used the cortical thickness (CT), gray matter probability (GMP) extracted from sMRI and ReHo, and functional connectivity (FC) extracted from fMRI as features to improve the classification accuracy of ADHD [10]. The authors not only compared the impact of each feature on classification but also fused the features through multi-kernel learning, with the classification accuracy reaching 61.5%. The same year, Sidhu et al. used the fast Fourier transform and kernel principal component based on phenotypic and imaging which yielded accuracies of 76.0% on two class diagnosis [11]. In addition, Zou et al. [12], proposed a 3D-convolutional neural network (CNN) deep learning classification method based on fMRI and sMRI. Firstly, ReHo, fractional amplitude of low-frequency oscillation (fALFF), and voxel mirrored homotropy connectivity (VMHC) were extracted manually from fMRI. Then, gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF) were extracted from sMRI. Finally, a 3D-CNN classifier was employed to evaluate the performance of each feature and the classification performance of a multi-feature combination given. The study showed that the combination of fALFF and GM yields the best result and the accuracy of ADHD classification is 69.2%. Complementing this, Riaz et al. [13] created an end-to-end network for ADHD classification, which consists of a feature extraction layer, similar network, and classification network. The network first extracted 90 features from 90 brain regions of fMRI after preprocessing and the similarity between features was calculated. The classification accuracy of this algorithm in the ADHD dataset of Peking, NeuroImage, and NYU reached 62.7%, 67.9%, and 73.1%, respectively. In addition, Kuang et al. [14] proposed an ADHD classification algorithm based on fast Fourier transform and deep belief network. The classification accuracy of this algorithm in the ADHD dataset of NYU, Peking, and KKI reached 37.41%, 54.00%,and 71.82%, respectively., Mao et al. [15] obtained good results using spatial information of each frame from fMRI images extracted by 3DCNN and the temporal information of fMRI time-series images extracted by feature pooling and long short-term memory (LSTM) models. Finally, the proposed 4D-CNN extracting the spatial and time information of fMRI at the same time achieved the highest accuracy of 71.3% in the application to ADHD classification.
Recently, the popularity of deep learning methods has resulted in their extensive application to various phenomena including as image denoising [16], image fusion [17], image recognition [18] and image classification [19]. As one of the most commonly used deep learning methods, CNNs can obtain the features of the input data through automatic learning, especially for high-dimensional data. However, as a supervised learning method, CNN needs a lot of labeled data in the training stage, which is not only time-consuming and laborintensive but also prone to over-fitting. Therefore, an unsupervised deep learning method is selected to perform the process of extracting features.
The autoencoder is a practical unsupervised learning model in deep learning and consists of an encoder and decoder. The former is employed to encode the original representation into the hidden layer representation while the latter is used to decode the hidden layer representation into the original representation. The training target minimizes the reconstruction error function via backpropagation. Generally speaking, the dimension of the hidden layer is lower than the original feature [20].
Since the autoencoder is just a concept, the encoder and decoder can be composed of a variety of deep learning models, such as a fully connected layer, convolution layer, and LSTM. CNN has advantages in image processing due to the ability to extract the spatial information hidden in the image. It is instinctively assumed that CNNs can work better than other autoencoders when constructing an encoder and decoder network, hence why the convolutional autoencoder (CAE) is generated [21].
To solve the problem of ADHD classification based on fMRI images, the convolution denoising autoencoder is proposed as the feature extractor in the feature extraction stage. CAE has the structure of CNN and autoencoder as well as the corresponding advantages. As a simple and efficient neural network, CAE can effectively extract useful feature information from the data for classification without massive labels [22], [23].
We adopt the convolutional denoising autoencoder (CDAE) for mining spatial features to fully extract 3D spatial information of fMRI data. The 3D convolutional denoising autoencoder was applied to train each frame of fMRI image in the feature extraction stage, after that the pre-trained encoder was used to extract the spatial features of fMRI. Considering the small amount of fMRI image data in the ADHD-200 dataset, we utilized the fMRI spatial features extracted in time order to perform dimension reduction processing again based on principal component analysis (PCA) to avoid over-fitting caused by "small sample and high-dimension". The data after dimension reduction was processed as the features of ADHD classification. We employed AdaDT as a classifier and the experimental results show that this algorithm can effectively classify the ADHD in the test set. The overall flow of the proposed method is shown in Fig. 1.
The main contributions of this article are as follows: (1) In this article, CDAE was employed to automatically extract the features of fMRI data, which can fully extract the 3D spatial information of fMRI data and avoid the unreliability and instability brought by hand-crafted features.
(2) The spatial features of fMRI extracted in time were reduced by PCA, which effectively avoids the over-fitting phenomenon caused by small samples of high-dimensional data.
(3) The adaptive boosting decision tree (AdaDT) can turn the weak classifier set of the trained decision tree into a strong classifier and effectively avoid the under-learning phenomenon caused by insufficient learning data to classify ADHD.
The remainder of this article is arranged as follows: the second section introduces the theoretical background of the proposed CDAE-AdaDT algorithm in detail; the third section is the experimental setup, including data processing and training details of the CDAE-AdaDT algorithm model; the fourth section describes and discusses the experimental results; the last section summarizes the algorithm and experimental results of this article.

II. METHODS
In recent years, extracting features of unlabeled samples through autoencoder has achieved encouraging results with the rapid development of unsupervised learning [24], [25]. Therefore, 3D convolutional denoising autoencoder was used to extract the features of fMRI in this article. The following describes the feature extraction algorithm used in our method.

A. Feature Extraction
As a kind of artificial neural network, deep neural networks (DNN) have attracted attention due to its improved performance. As a special structure of DNN, CNN [26], [27] has the advantages of local connectivity and parameter sharing. It can extract spatial information from the original data without other complex preprocessing. The CNN structure used in this article includes three basic layers: convolution layer, pooling layer and global average pooling layer. The common structure of CNN is shown in Fig. 2.
Traditional CNN employs a 2D convolution kernel in a 2D image. While fMRI data is a three-dimensional structure in space, a 3D convolution kernel is used in this article to make better use of the spatial structure characteristics of the fMRI image. In the convolution layer, a series of 3D convolution kernels are convoluted with the receiving domain of the input image or the feature map of the previous layer in the sliding window to learn the features of the data [28]. Let the output of neurons v xyz i j in (x, y, z) of the j -th feature map of the i -th layer be defined as where n is the index of the i -1 feature map, b i j denotes the bias, and P i , Q i and R i are the length, width, and height of the convolution kernel respectively, w pqr i j n is the value of the convolution kernel connected to the n-th feature map, and f is the nonlinear activation function.
The convolution layer is connected with the pooling layer. Generally speaking, there are two kinds of pooling: max pooling and average pooling. The pooling operation down-samples the feature map to reduce the network parameters which can lessen the amount of computation while the characteristic of space invariance [29] can preserve the spatial relationships. In this article, we choose the max pooling operation and the last network we use is the global average pooling (GAP) [30]. Unlike the traditional fully connected layer, GAP is used to combine the feature map in a non-linear way, which can not only reduce the number of network parameters and improve the training speed but also effectively prevent the occurrence of over-fitting.
Whereas the great success CNNs have achieved in various fields, especially in image classification [31], [32], it cannot be ignored that the classification algorithm based on CNN needs a lot of manually marking data since it is a type of supervised learning [33], [34]. Nevertheless, manually marking workload is time-consuming in ADHD classification, which brings great difficulty to the application of CNNs. Consequently, the classification algorithm based on unsupervised learning has attracted attention in recent years in view of the advantages of requiring no labels.
Autoencoder (AE) is an unsupervised algorithm that can learn from data automatically. The purpose of the AE is to select encoder and decoder functions so that the image can be encoded with the least information and be reconstructed on the other side [35]. As an unsupervised learning method, AE can reconstruct the output data into the input data without labels while preserving the dimensions of the original data [36]. Fig. 3 is a schematic diagram of the autoencoder.
It can be seen from Fig. 3 that there are two parts in the autoencoder: encoder and decoder. In the structure of the autoencoder, each layer is fully connected with the next layer  with an activation function. However, each training of the autoencoder is a comparison of the original data itself, which will increase the similarity between input and output, but not sensitive to other images of the same kind. This is especially obvious in fMRI images that have nuances between different frames.
Vincent et al. [22], [23] proposed a denoising autoencoder aiming to solve the aforementioned problems. Random noise is added to the images to make each input vary slightly before applying them to the network. Finally, the output of the autoencoder is compared with the clean image before adding noise to optimize the network. In this way, the network will have better generalization ability when processing data, with little difference between different frames [37]. Fig. 4 is a schematic diagram of a denoising autoencoder.
The denoising autoencoder can not only extract the lowdimensional representative features from the original data but also recover the clean image from the noisy data, which can effectively prevent the over-fitting problem while retaining robustness [38]. The excellent performance of CNN directly promotes the generation of CDAE. Strictly speaking, CDAE is a special case of traditional denoising autoencoder, which uses the convolution layer and pooling layer instead of a fully connected layer. CDAE combines the merits of CNN and denoising autoencoder and can not only acquire the robust spatial characteristics of input data through learning but also effectively prevent overfitting.
Firstly, data with random noise is used as input into the neural network. The reconstructed data should be as close as possible to the original data instead of the noisy data, that is, the clean input is recovered from the corrupted data. Let . . x p represent the raw data, where x i denotes the voxel of the fMRI and p ∈ [0, 60 × 72 × 60]. Let the data with random noise bex = x 1 , . . .x p , where x i is the voxel added random noise and p ∈ [0, 60 × 72 × 60]. The noisy data is sent to the encoder network of CDAE to obtain the hidden layer data, and the hidden layer data can be obtained as where W is the weight matrix, b denotes the bias vector, * represents the convolution operation and g represents the nonlinear activation function. The decoder can be regarded as the "mirror" of the encoder to some extent. The decoder recovers the same amount as the original data by using the max unpooling layer which adopts nearest-neighbor interpolation after each deconvolution layer. Accordingly, the decoder restores y from the hidden layer h, that is where W T and b T are the transposition of W and b respectively. Thus, there is a certain relationship between the weights of the autoencoder, which will reduce the parameters by half and effectively decrease the complexity of the network [39]. The denoising autoencoder optimizes the network by minimizing the reconstruction error of y and x. Compared with the autoencoder, the denoising autoencoder can not only reliably capture the main change factors from the noise dataset without assuming linearity but is also robust and can effectively prevent over-fitting [40]. In this article, CDAE is employed for feature extraction and only the encoder part of the trained CDAE model is adopted for feature extraction of fMRI sequences. To improve the performance of extracted feature classification, we add a global average pooling layer after the encoder of the CDAE to convert the obtained fMRI features into one-dimension feature vectors: where K is the number of activation values and c ∈ {1, 2, . . . , n} denotes the frames of the fMRI. We then form the initial feature vectors by connecting the data of one-dimension feature vectors end-to-end according to the time dimension.
The classifier is prone to over-fitting due to the small amount of fMRI data and the correlation between the feature vectors extracted by CDAE. Therefore, PCA is used to decorrelate the initial feature vectors of fMRI to solve the problem.

B. Classifier
There are many kinds of classifiers, such as linear discriminant, naive Bayesian classification, k-nearest neighbor, support vector machine, random forest, decision tree, etc. [41], [42], and [43]. In this article, we adopted AdaDT for ADHD classification. The decision tree is a tree structure, in which each internal node represents a judgment on each attribute while each branch represents an output of judgment, and finally each leaf node represents a classification result [44]. It is a very common and supervised learning classification method. Supervised learning means that the classification results are known and a decision tree is obtained by learning these samples. In this way, the decision tree can classify the new data correctly.
The classifier used in this article is the classification and regression trees (CART) algorithm. CART is a binary tree, where the data is cut into two parts each time by using a binary segmentation method and sent into the left and right subtree [45] respectively. Each non-leaf node has two children, so there are more leaf nodes in CART than non-leaf nodes. In CART classification, the Gini index, namely Gini impurity, is used to select the optimal data segmentation feature, which is similar to the meaning of information entropy. Each iteration in CART will reduce the Gini impurity. The smaller the Gini impurity is, the higher the purity is, and the better the classification is. The definition of Gini impurity (G) is shown as where S represents all samples, p i represents the probability of the i -th category, and k represents the total number of categories. The decision tree is powerful but unstable. The decision tree will change greatly when the training data varies [46]. Compared with the single decision tree algorithm, the integrated tree algorithm has a higher prediction ability and can overcome the problem that is difficult for a single decision tree. The integration algorithm trains multiple learners to solve the same problem and the commonly used combination methods are bagging and boosting [47]. Boosting is used in this article. Adaptive boosting (AdaBoost) is one of the most popular reinforcement algorithms as a supervised learning method. It combines weak classifiers with certain rules to build a strong classifier [48], [49]. AdaBoost determines the weight of each sample according to the classification in each training and the accuracy obtained in the last overall classification, and then the data with new weight is transferred to the next classifier for training. Finally, the classifier obtained in each training is fused and the classifier obtained by fusion is the final decision classifier to achieve the target classification. Compared with other machine learning algorithms, the AdaDT classifier will not reduce the generalization ability of the classifier with the increasing number of iterations and can avoid over-fitting at the same time, which makes AdaDT more suitable for medical images with fewer samples. Specifically, the implementation steps of AdaDT are shown in algorithm 1.

A. Data and Preprocessing
The data we used is from the ADHD-200 public dataset. The dataset consists of eight international imaging sites, including 973 individuals' rs-fMRI, sMRI and basic phenotypic information (age, gender, dominant hand and intelligence quotient (IQ)), which contains 362 children and adolescents diagnosed
Step: 1. Initial the weight distribution of training data W 1 (i ) = 1 N where i = 1, 2, . . . , N, denotes the total number of the samples, t = 1, . . . , T is the number of iterations. 2. Training weak classifier h t = ξ (W t ) based on sample distribution W t cyclically. 3. Calculating the weak classifier corresponding to the j -th feature, the error ε i is calculated as

Output:Final classification results
as ADHD, 585 TD and 26 unknown individuals [9]. We only used five sites including Peking, KKI, NeuroImage, NYU and OHSU. The other three sites were not used in this experiment because Brown University (Brown) lacks the diagnostic information of each subject, the University of Pittsburgh (Pittsburgh) and Washington University (WashU) only have TD subjects in the training set and lack ADHD subjects. In conclusion, we decided to exclude these three sites and only use the data of the remaining five sites for testing since the classification is related to the proportion of data.
In this article, the Data Processing Assistant for Resting State fMRI (DPARSF) toolbox in [50], [51] was used to process the raw fMRI data, with the processing flow as follows: (1) To achieve data balance, the first four-time points of training data and the first three-time points of test data were removed to eliminate the influence of instability; (2) Slicetiming correction; (3) Head correction; (4) Normalized into the Montreal Neurological Institute (MNI) space, resampled to 3-mm isotropic voxels; (5) Band-pass filtered; (6) Linear detrended, remove the nuisance covariates including WM, CSF, global signal and six head motion parameters; (7) Smooth using a Gaussian filter with Full Width Half Height (FWHM = 4 mm). Before the experiment, samples without  the corresponding rs-fMRI data were checked and deleted.
To prevent the noise in the scanning process from interfering with the fMRI data, the data of the subjects whose head movement is more than 3mm or rotation is more than 3 degrees were removed after preprocessing. At the same time, the data of the subjects with artifacts and poor registration effects were removed through visual inspection. Finally, the data composition used in this article is shown in Table I. The number of fMRI frames participating in CDAE model training was 93650.

B. Model Training
The deep learning model is implemented by the keras framework with tensorflow as the back-end. The optimizer adopts the Adam optimizer with a learning rate of 0.0001 and a batchsize of 50. CDAE consists of two parts: convolution encoder and deconvolution decoder. The encoder part is employed to extract the feature map of the frame of the fMRI while the decoder is used to reconstruct the image from the feature map. Fig. 5 shows the CDAE-AdaDT model proposed in this article. Fig. 5 (a) shows the training process of the CDAE model with 15 layers. The first layer and the last layer are input and output layer respectively. The second layer to the seventh layer belongs to the encoder and the eighth layer to the fourteenth layer belongs to the decoder. Each layer is connected to the next layer by linear multiplication and activation function. The network is optimized by minimizing the loss function. Fig. 5 (b) shows the ADHD classification flow chart based on the CDAE-AdaDT model.
The detailed training steps of CDAE-AdaDT model are as follows: First, every single fMRI image with random noise was used as an input with the size of 60 ×72 ×60. The encoder consists of three layers of the convolution layer and each convolution layer is connected with a max-pooling layer. The kernels in the convolution layers are 3 × 3 × 3, 2 × 2 × 2 and 3 × 3 × 3 respectively while the window size of max-pooling layer are 2 × 2 × 2, 3 × 3 × 3 and 2 × 2 × 2 respectively. The maxpooling layer can reduce the size of the feature map and the parameters of the network. The kernel size and window size of the pooling layer in the decoder are symmetric to the encoder. At the end of the encoder network, a dense layer was added to realize the nonlinear combination of features and increase the representativeness of the extracted features. The model learned the abstract features of the image during training. As shown in Fig. 5 (a), the output of the previous layer was the input of the next layer. The activation function is the corrected linear units (ReLU), that is where z is the input of the next layer.
In this article, the adaptive moment estimation (Adam) algorithm was adopted to optimize the network by minimizing the error between the clean image and the reconstructed image, that is where x i denotes the original clean image and y i is the reconstructed image. Finally, the pre-trained encoder was used as the initial feature extractor of fMRI data. It is worth noting that the scrambled fMRI frame data is used in the training of CDAE, while the fMRI data is passed through the encoder in chronological order to obtain the time characteristics in the feature extraction stage. The output of the encoder was connected to a global average pooling layer which transforms the features of the extracted fMRI sequences into one-dimension vectors. The global average pooling layer has fewer parameters compared with the traditional fully connected layer. Because of the characteristics of "small sample with high-dimension" of fMRI data, PCA was adopted to reduce the dimensionality of the extracted time series to reducing the occurrence of the overfitting problem. PCA is a kind of data dimensionality reduction method widely used in data analysis. The initial feature vector of fMRI after PCA is n × p c , where n is the time points of each fMRI and p c is the number of features left after dimension reduction. In this article, the selected value of p c is 48. And n varies according to the subjects of different sites.
The data of different sites after dimension reduction are sent to the classifier for training. The scanning parameters are different among sites and testing at all sites may aggravate the impact of data heterogeneity, hence we chose to train classifiers and test them on individual sites. It is very important to choose two parameters when training AdaDT classifier: the number of weak classifiers and the number of nodes in each tree. In this article, the optimal parameters were selected for each site classifier to achieve the best classification results through a large number of experiments.

A. Visualization
The traditional fully connected network is a "black box" algorithm. To better understand the features learned by CDAE, the weight of the convolution layer and the feature map are visualized. Weight plays an important role in the neural networks, and the Xavier is used to initialize the weight, that is, the weight is initialized to a uniform distribution, keeping the variance of information flowing in the neural network unchanged. Fig. 6 shows the visualization of 16 weights of the first convolution layer. Fig. 6 shows that the weights of the first convolution layer changed differently compared with the weights of the initial state, which had a uniform distribution. Different weights of convolution kernels mean that different convolution kernels can extract features from different angles, so they can effectively learn and process the fMRI image. Fig. 7 shows the output of the first convolution layer on the first filter. Fig. 8 shows the difference of the feature map of the first filter in the third convolution layer of ADHD and TD randomly selected. As can be seen from Fig. 7 and Fig. 8, the features learned by the model became more abstract with the increase of the convolution layer.

B. Comparison of Different Parameter Values
In order to choose the best p c value and classifier, we compare the combination of different p c values and classifiers. We adopt the grid search to select the optimal parameters of the classifiers. The comparison of classifiers includes linear support vector machine (L-SVM), radial basis function kernel support vector machine (RBF-SVM) and random forest (RF). The PCA values are 12, 24, 48 and 96. The accuracy, sensitivity and specificity are used as evaluation indices. The accuracy represents the ability of the model to distinguish ADHD and TD correctly. It is given by the following formula: Accur acy = T P + T N T P + T N + F P + F N (9) Sensitivity represents the ability to distinguish ADHD correctly, which is estimated by the following formula:  Specificity describes the ability of the model to distinguish TD correctly, which can be obtained from the following formula.
where T P is the true positive rate (the number of correctly classified as ADHD), F P is the false positive rate (the number of correctly classified as ADHD), T N is the true negative rate (the number of correctly classified as TD), T P is the false negative rate (the number of wrongly classified as TD in ADHD patients). Fig.9 shows the boxplot of three evaluation indices obtained by different classifiers based on all test sets. It indicates that the ensemble classifiers (RF and AdaDT) yield better results than the SVM classifiers. What's more, the AdaDT can keep the balance among all the three indices. Since the Accuracy, Sensitivity and Specificity are of equal importance, we adopt the three indices as the data of the boxplot on the all test sets to select the best value of p c . Fig.10 shows the results with different combination of classifiers and p c . Upon inspecting Fig. 10, we see that the indices increase the value of p c when it is less than 48 and achieve the best performance when p c is 48. However, the performance decreases when p c is greater than 48. And the AdaDT yields the best results when p c is 48. On the basis of the above research results, we adopt AdaDT as the classifier and 48 as the value of p c to perform classification.

C. Comparison of Classification Results Among Different Sites
In this article, the test dataset of ADHD-200 is used to evaluate the performance of the model. There are two different ways to compare the results in the literature using ADHD-200 data: classification comparison among different sites and classification comparison in comprehensive sites. The vast majority of literature only choose one method to explain the effectiveness of the experiment, meanwhile the evaluation indicators vary according to the comparison methods. In order to make the proposed method more persuasive, we employ each of the two methods to explain the experimental results.

1) Comparison of Classification Results Among Different
Sites: The following ADHD classification algorithms were selected for comparison: (1) the 2017 ADHD-200 global competition champion algorithm (ADHD-2017) provided in [9]; (2) the deep belief network based ADHD classification algorithm (DBN) proposed in [14]; (3) the 3D-CNN based ADHD classification algorithm (3D-CNN) proposed in [12] (4) an R-RELIEF based ADHD classification algorithm (R-RELIEF) proposed in [52]. Table II shows the results of the proposed method and comparison algorithms on the test set, where "-" represents that the corresponding site had not been adopted, so the corresponding experimental results are none. Table II shows that the accuracy of our method is the highest in different sites which is 70.59%-83.33%. Compared with ADHD-2017, the accuracy of different sites increased by 16.98-40.42%; the accuracy of DBN in NYU is only 37.41% while the accuracy in NYU is increased by 37.98% in our algorithm. And in KKI, the accuracy of our method is the same as R-RELIEF while the OHSU is added in our method and the rest of the sites are better than R-RELIEF.
In order to comprehensively demonstrate the effect of the model, the ROC curve and (Area Under the Curve) AUC are employed to evaluate the model. The ROC curve takes the false positive rate (i.e. specificity) as the abscissa and the true  positive rate (i.e. sensitivity) as the ordinate, which can reflect the trend of sensitivity (FPR) and specificity (TPR) of the model when selecting different thresholds. Compared with P-R curves (accuracy and recall rate), ROC curve has a huge advantage that when the distribution of positive and negative samples changes, its shape can basically remain unchanged, while the shape of P-R curve generally changes dramatically. This evaluation method can reduce the interference caused by different test sets and more objectively measure the performance of the model itself. AUC is the area under ROC curve. The larger the AUC value is, the better the model classification effect is. Fig. 11 shows the ROC curves and AUC values of the algorithm at different sites.

2) Comparison of Classification Results in Comprehensive
Sites: In order to make the experiment more intact, we also attempt to test our model in all site datasets. The comparison methods are as follows: (1) The classification algorithm (denoted as MKL) by using multi-kernel learning fusion multimodal MRI features is proposed in [10]; (2) The classification algorithm (denoted as MDS-SVM) by using support vector machine after multi-dimensional scaling of functional connection network is proposed in [53]; (3) The algorithm of ADHD classification (denoted as 3D-CNN) based on 3D-CNN proposed in [12]; (4) The algorithm (denoted as 4D-CNN) based on 4D-CNN proposed in [15]. Table III shows that the proposed method yields the best results compared with others.

V. CONCLUSION
In this article, a new ADHD classification method based on fMRI is proposed, which can directly extract features from fMRI images to classify ADHD and TD. The experimental results at different sites show that the proposed method is superior to the existing methods in accuracy and can maintain a certain balance between specificity and sensitivity. Visualizing the feature maps of the middle layers shows that CDAE can effectively extract local information from spatial dimensions, which is helpful for classification. Although the pretraining of CDAE will increase the computational complexity of training and storage, it can effectively improve the performance in the classification of ADHD. In future work, we will focus on how to eliminate the impact of data heterogeneity on classification results as much as possible. Given the lack of utilization of the fMRI data as a time series (thereby implicitly ignoring time as an independent dimension), we will try to explore a better model and method to extract time dimension features.