Outlier-based Autism Detection using Longitudinal Structural MRI

Diagnosis of Autism Spectrum Disorder (ASD) using clinical evaluation (cognitive tests) is challenging due to wide variations amongst individuals. Since no effective treatment exists, prompt and reliable ASD diagnosis can enable the effective preparation of treatment regimens. This paper proposes structural Magnetic Resonance Imaging (sMRI)-based ASD diagnosis via an outlier detection approach. To learn Spatio-temporal patterns in structural brain connectivity, a Generative Adversarial Network (GAN) is trained exclusively with sMRI scans of healthy subjects. Given a stack of three adjacent slices as input, the GAN generator reconstructs the next three adjacent slices; the GAN discriminator then identifies ASD sMRI scan reconstructions as outliers. This model is compared against two other baselines -- a simpler UNet and a sophisticated Self-Attention GAN. Axial, Coronal, and Sagittal sMRI slices from the multi-site ABIDE II dataset are used for evaluation. Extensive experiments reveal that our ASD detection framework performs comparably with the state-of-the-art with far fewer training data. Furthermore, longitudinal data (two scans per subject over time) achieve 17-28% higher accuracy than cross-sectional data (one scan per subject). Among other findings, metrics employed for model training as well as reconstruction loss computation impact detection performance, and the coronal modality is found to best encode structural information for ASD detection.

The National Database for Autism Research (NDAR) [21] and Autism Brain Imaging Data Exchange (ABIDE) [22,23] are popular open-access databases for ASD research. Neuroimaging scans are obtained either as cross-sectional (one sample per person) or longitudinal (multiple samples per person captured over time) samples. These scans enable us to examine and monitor changes in brain structure and function in individuals over time. Most neuroimaging-based diagnostic research are based on cross-sectional data. Recently, a few researchers have analyzed longitudinal data to predict neurological disorders via machine learning [24]. Among the aforementioned databases, ABIDE II [23] provides longitudinal samples. Longitudinal data collection is difficult, as data needs to be acquired for the same subject at various time points. Subject readiness to engage in multiple scanning sessions is not assured in longitudinal setups, and subject-specific samples typically drop over time. A major drawback of longitudinal studies is therefore a limited sample size and fewer participants [25].
Conventional machine learning frameworks use various handcrafted features and classification techniques such as Support Vector Machine (SVM). However, the handcrafted features are the bottleneck for the success of the frameworks. In-depth domain knowledge and experience is usually required to design such handcrafted features. Differently, deep learning frameworks accomplish the same by intelligently learning the intricate bio-markers using substantial amount of training data. The learnt features, usually the outputs of initial layers of the deep neural nets, can sometimes be related to the handcrafted features diagnosed by the medical experts. Thus, deep learning frameworks complement, not replace the physician's regular diagnosis of medical disorders. ASD has been diagnosed via deep learning [26], [27].Among deep learning architectures, autoencoders [28] enable low-dimensional embedding of a high-dimensional input via an encoder-decoder block. We employ a Generative Adversarial Model (GAN)-based encoder-decoder framework for sMRI-based ASD detection. Learning structural brain connectivity, an encoder maps an sMRI image slice onto a low-dimensional vector; the decoder then reconstructs the next slice from this embedding. The actual and reconstructed next slices are compared to compute the reconstruction loss, which is then back-propagated to train the GAN. When the GAN is exclusively trained with healthy sMRI scans, higher reconstruction losses would result for ASD scans due to structural connectivity differences between normal and ASD subjects. The GAN discriminator would therefore view ASD scans as outliers (with reconstruction loss greater than threshold), enabling unsupervised ASD detection.
Single slice reconstruction error [29,30] has typically been employed as the objective for model training. Differently, we conjecture that structural connectives between adjacent sMRI slices capture class-specific characteristics better than single slices, and train the GAN model with stacks of three contiguous slices. We also evaluate three encoderdecoder architectures-GAN [31], UNet [32] and Self-attention GAN (SAGAN) [33] for detection efficacy. Further, most works on longitudinal ASD data analysis employ supervised learning where availability of sufficient ASD data is critical, but ASD data are scarce. Modeling ASD scans as outliers as in our approach addresses this issue. In summary, this paper makes the following research contributions.
1. We employ a GAN encoder-decoder framework for sMRI-based ASD detection. The GAN trained exclusively from healthy scans views ASD samples as outliers and enables ASD diagnosis. This approach obviates the need for many ASD training samples.
2. To effectively model structural brain connectives, stacks of three adjacent sMRI slices are input to the GAN to reconstruct the next three slices. Slice reconstruction loss is employed as the training objective. Empirical results (Table 7) confirm that modeling structural patterns from three-slice stacks is more beneficial vis-à-vis single slices.
The L2+cosine loss objective is found to be the most effective, and a combination of the Axial and Coronal slices achieves the best detection performance.
The paper is structured as follows. A survey examining related work is presented in Section 2. Section 3 details our framework and other baselines. Section 4 discusses empirical results, and the paper concludes in Section 5.

Related work
This section reviews longitudinal sMRI-based ASD diagnosis, and deep learning models developed to this end.

Longitudinal Studies on ASD detection
ASD detection has been attempted with both cross-sectional and longitudinal sMRI data. Wang et al. [34] conducted a study of cerebellar thickness to determine longitudinal differences associated with ASD. The analysis used longitudinal scans from the ABIDE II dataset, which includes 19 ASD subjects and 14 healthy subjects. Correlation between ADOS scores and lobular thickness data was examined. Subjects with ASD showed smaller lobular thickness and asymmetry in the right cerebellum, and this reduction is associated with the severity of behavioral symptoms.
Fu et al. [35] studied Gray Matter and White Matter association with respect to ASD using longitudinal sMRI and Diffusion Tensor Imaging for 34 ASD and 26 healthy subjects. Chi-square and t-tests were used to compare demographic and clinical features extracted from the baseline and follow-up scans for ASD and healthy subjects. The study discovered that at across time, Fractional Anisotropy (FA) and brain volume of white matter was higher in ASD subjects.
Ning et al. [36] analyzed the developmental patterns of core-symptom-anchored cortical vertex-wise Gyrification Index (GI) in ASD. They used data from 321 ASD and 350 healthy subjects from ABIDE I, and 14 ASD plus 7 healthy subjects' longitudinal data from the ABIDE II dataset. Statistical differences between two groups were examined using chi-square and t-tests. While comparing GI between the baseline and follow-up conditions, significant variations were discovered in ten ASD clusters, with nine clusters showing decreased gyrification. Prigge et al. [37] reported longitudinal volumetric findings in ASD subjects acquired from FreeSurfer. Linear mixed-effects models were used to characterize longitudinal volumetric changes in the brain over time. ASD-specific findings included larger gray matter in early childhood, enlarged ventricles by early adulthood, and reduced corpus callosum volume in adulthood. Devika et al. [38] investigated longitudinal sMRI samples from ABIDE II for supervised ASD detection, and reported a classification accuracy of 94.29% using Support Vector Machines.

Deep learning Models
Mostafa et al. [39] used Convolutional Autoencoder (CAE) for single sMRI slice reconstruction to diagnose ASD. The study used T1-weighted sMRI scans from 403 ASD and 468 healthy subjects from ABIDE-I [22]. The CAE was trained with healthy subjects and tested with both ASD and healthy subjects. Similarity indices including Structural Similarity Index (SSIM), Mean squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR) values were used as features for SVM and Linear Discriminant Analysis (LDA) classification. Baur et al. [40] developed an unsupervised UNet model on MRI of healthy subjects to detect variations corresponding to anomalous MRI. The model was tested on five different MRI datasets, and achieved a highest F1-score of 62%.

Generative Models for Clinical Diagnosis
In medical imaging, generative models have become popular for clinical diagnosis due to their ability to learn complex distributions from input samples [41,42]. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) are popular generative models, but GANs are advantageous as they do not explicitly compute probability densities and yield better results than VAEs via a game-theoretic approach. AnoGAN [43] was the first work to employ GAN to detect retinal anomalies using spectral-domain Optical Coherence Tomography (OCT). A Deep Convolutional GAN (DCGAN) was trained using 2D image patches compiled from clinical OCT volumes of healthy subjects. This model was tested with both healthy and pathological samples, and the weighted sum of the residual and discrimination losses was used to compute the anomaly score. Inspired by AnoGAN, unsupervised metastatic bone tumor classification with GAN was proposed in [44]. Anomaly scores were determined by comparing the test image with a synthesized one at both the image and feature levels. Although AnoGAN demonstrated high performance, iterative techniques suffer from computing inefficiency for real-world applications, which was addressed via fast AnoGAN (f-AnoGAN) [45] that learnt a mapping from image to latent space with a Wasserstein GAN.
Han et al. [46] developed a two-step procedure for abnormality diagnosis from T1-weighted (T1w) sMRI Axial slices. Training was done on healthy subjects, while testing included both healthy and Alzheimer's disease samples. UNet and GAN architectures were investigated, and a maximum AUC score of 0.92 was reported with the GAN architecture. Han et al. [47] studied the effect of self-attention (SA) modules in GAN architectures to detect AD and brain metastases. The study used longitudinal samples, plus a cross-sectional dataset compiled by the authors, and reported a highest AUC score of 0.89 and 0.92 for Alzheimer's disease and brain metastases detection respectively. Translation from Abnormal-to-Normal GAN (ANT-GAN) is a variant of CycleGAN [48], which was developed to synthesize a normallooking medical image from an abnormal one, and vice versa. The proposed method was tested on MRI and CT scans from publicly available datasets. ANT-GAN was able to synthesize extremely realistic healthy scans that nearly matched images with lesions. Once a healthy scan is generated from an abnormal counterpart, discrepancies between the input and synthesized images can then be utilized to segment abnormal regions and contrast between healthy and abnormal scans. In ASD, GAN-generated synthetic data has significantly improved classification performance [49].

Methodology
Preprocess  An overview of the proposed framework is illustrated in Figure 1. There are five stages. Firstly, we pre-process sMRI longitudinal scans via the Freesurfer longitudinal pipeline. Secondly, we extract 2D slices from longitudinally pre-processed 3D sMRI scans. Thirdly, Axial slices are selected for further processing. Fourthly, a GAN-based encoder-decoder framework is trained only using healthy subjects' data. The GAN objective is to reconstruct the next three adjacent slices from an input stack of (current) three adjacent slices. Finally, classification performance is evaluated based on the average loss between the reconstructed and ground-truth sMRI slices. Details of each stage and background are presented below.

Dataset Description
Autism Brain Imaging Data Exchange (ABIDE) is an open-source data collection with two subsets: ABIDE I [22] was launched in 2014, and ABIDE II [23] in 2017. ABIDE II includes data from 19 different sites, with longitudinal data from two sites (UCLA and UPSM) [23,50]. The scanner, scanning specifications, and scanning method are all different since each sMRI scan is gathered independently. In accordance with prior studies [39,51,52], heterogeneity of the dataset is not explicitly addressed. Longitudinal data from these two sites includes data for 23 ASD and 15 healthy subjects. The subjects ages range from 9-17 for baseline scan (median age of 12.6), and 10-19 at follow-up scan (median age of 15) across the healthy and ASD groups. The two longitudinal sets include T1-weighted (T1w) sMRI, rsfMRI scans in NIFTI format and phenotype information in comma separated value (.csv) format, collected during a one-to-two year period. By measuring the amount of water in the tissues, the sMRI data reveals the various types of tissues present. In T1w images, water and fluid-containing tissues look dark, while fat-containing tissues appear bright. This paper uses only longitudinal T1w sMRI slices for investigation.

Data pre-processing
Pre-processing is a necessary step for reducing inter-subject data variability resulting from data collection. Despite data being collected at different places, prior sMRI studies on ABIDE I and II only employed general pre-processing steps [19,51]. We followed the same suite and ignored inter-site scan capture variations. We employed the open-source Freesurfer (v.6.0) longitudinal pipeline to pre-process the T1w sMRI longitudinal samples [53,54]. Longitudinal processing requires cross-sectional processing followed by generation of a within-subject template (base image) via sequential, inverse-consistent registration of each time point scan to an average image. Following which, each time point scan is processed independently. The longitudinal pipeline is found to have higher cross-session dependencies than the cross-sectional pipeline [55]. We intend to leverage these dependencies with longitudinal samples. While pre-processing, three ASD subjects had surface reconstruction errors due to poor image quality, and hence their data were discarded.

Multiple sMRI slice reconstruction
Longitudinally  [56,57]. We thus obtained slices along the Axial, Coronal and Sagittal planes. Among these, the Axial plane is popularly used [58,59], but Convolutional Neural Networks (CNNs) have also effectively learned from Coronal plane [60]. In this study, we examined the utility of all three sMRI imaging planes. Of the 256 slices in a typical sMRI scan, slice sequence 120-180 is known to contain a majority of the vital brain information [39]. Hence, we extracted this 60-slice sMRI sequence for future stages. Extracted image dimensions (height × width) for the three planes are 256 × 176 (Axial), 256 × 256 (Coronal) and 256 × 256 (Sagittal). The Axial dimensions are specified by default in the later sections. We sought to model structural connectives between adjacent sMRI slices by feeding the GAN with, and reconstructing contiguous slices. To determine the optimum slice combinations, we considered the UNet model and conducted a preliminary experiment with four different input-output slice combinations, namely, 3-3, 3-5, 5-3 and 5-5. For instance, the UNet33 was trained with a 3-3 slice combination where the adjacent 3 slices (e.g., 1,2,3) of dimension 256×176×3 were input to reconstruct the next adjacent 3 slices (e.g., 4,5,6) as in [46,47]. Sample input and predicted output slices for the 3-3 and 3-5 combinations are shown in Table 2. The same convention is used for the UNet35, UNet53, UNet55, GAN33 and SAGAN33 models. The L2 loss function was employed for training. The performance and computational time for each model on a computational cluster with a 28-core NVidia V100 GPU with 1.125TB RAM is shown in Table 3. To maintain reasonable model training time and performance, we chose the 3-3 slice combination for our experiments.

Architecture details
For slice reconstruction, we explored three networks, a GAN, a computationally less-intensive UNet and a more computationally intensive self-attention GAN (SAGAN). Their architectures are described below.

UNet33 architecture
The UNet33 architecture is illustrated in Figure 2. There are two paths, namely, the contracting path (encoder), and the expansion path (decoder) [61]. The output dimensions ( height×width×channels) of each layer is specified within the boxes. Layers C1-C5 each denoting a series of two convolutional layers including batch normalization (BN) and rectified linear unit ReLU) activation, and max-pooling layers P1-P4 are part of the encoder, where the input image size and depth respectively decrease and increase from 256×176×3 to 16×11×256. Transposed convolutions (DC1, DC2, DC3, and DC4) are applied in the decoder, where the image size and depth respectively increase and decrease from 16 × 11 × 256 to 256 × 176 × 3. Skip connections are added at all decoder stages (S1-S4) by concatenating the transposed convolution layer outputs with the corresponding encoder features. Every skip connection is followed by two regular convolutions (C6-C9), and a dropout of 0.5 (denoted by 'D') is performed at two points following skip connections to prevent overfitting [62,63]. The output layer involves a 1 × 1 convolution with Sigmoid activation. Adam optimizer with a learning rate of 2.0 × 10 -4 , and the L2 loss function are employed for model training.
L2 loss to measure reconstruction quality is defined as: where X i,j and Y i,j denote the i th ground truth/reconstructed slice stack of size n, i ∈ 1 . . . m. The input and reconstructed slice stacks are more similar as L2 loss decreases. We used early stopping with a patience threshold of ten epochs, and a batch size of 8. Hyperparameters were fine-tuned via grid search, and an input/reconstructed stack length of m = 3 was fixed as it achieved the best accuracy as seen from Table 3. Figure 3 presents the architecture for GAN33. The GAN comprises two neural networks: a generator G that receives input adjacent 3 slices and reconstructs the next adjacent 3 slices. The generator involves a UNet-like architecture with four 4 × 4 convolution layers in the encoder, and four 4 × 4 deconvolution layers (DeConv with stride = 2) in the decoder with same-level skip connections, and two dropout layers of 0.5. BN is applied to the convolutional and deconvolutional layers with Leaky ReLU and ReLU activation functions. The discriminator receives both the generated output and the ground-truth slice-stack. It uses 3 decoders. Given the training size, we used 1650 training steps with a batch size of 8, Adam optimizer with a learning rate of 2 × 10 -4 and the WGAN-GP+100L1 loss function. The WGAN-GP is an advanced version of WGAN, and uses the gradient penalty for regularization; this increases training stability and prevents mode collapse [64]. We also employ the L1 loss as it facilitates a sharper reconstruction [65]. This WGAN-GP+100L1 loss function enables the synthesis of counterparts structurally similar to the ground-truth slices.

SAGAN33 architecture
SAGAN33 is a GAN33 with self-attention (SA) modules added as shown in Figure 3. The SA module [33] is shown in Figure 4. Three 1 × 1 convolutions are used to segregate feature maps acquired from the previous convolution layer. The SA mechanism is applied over feature maps obtained from the transformations f, g and h. This ensures that distant image parts are compatible with each other, unlike normal GANs [33]. Long-range dependencies among the image regions is established via the SA mechanism. The local and global image dependencies are combined to enhance details and quality of the reconstructed images. Seven SA modules are included in the SAGAN-five SA modules in the generator, and two SA modules in the discriminator. These layers are complementary to the convolutional layers, and allow the network to capture finer information. Output size of the SA modules is identical to the input [66]. Hyperparameters for SAGAN33 were set identical to GAN33.

Performance evaluation
We assume significant sMRI structural differences between ASD and healthy subjects [39], which should reflect via dissimilarities between reconstructed healthy and ASD slices. We examined the utility of the a) L2 and b) cosine loss functions for threshold-based outlier detection. The L2 loss is defined as in Eqn. (1), and ranges from 0 to ∞. The cosine similarity loss or distance [67] computed for a pair of vectorized slice stacks (X, Y ) as shown in Eqn. (2). The     cosine loss enforces similarity between the generated and actual slices [68,69], and ranges between 0 (for identical) to 1 (for highly dissimilar) slice stacks.
For classification, a threshold value (τ avg ) is computed from the training samples as: where n denotes the number of adjacent three-slice combinations per scan, Loss is the reconstruction loss between predicted vs actual slices, and N denotes the number of subjects in the training set. Test samples are classified based on the threshold value, i.e., if the reconstruction loss for the test sample is less than τ avg , it is marked as healthy or else as ASD. Two alternative thresholds are shown in Equations (4) and (5). These are based on the maximum (or minimum) of the maximum (or minimum) reconstruction Loss per subject.
We use τ avg as the threshold metric in our experiments, as this threshold reduces the number of false positives and false negatives, ensuring high sensitivity and specificity [70]. For performance evaluation, we use model accuracy defined as: where TP, TN, FP, and FN respectively denote the number of True Positives, False Positives, True Negatives and False Negatives. TP and TN represent correctly classified ASD and healthy samples, whereas FP and FN denote incorrect predictions. We additionally report the area under the receiver operating characteristic curve (AUC) for evaluation.

Results and discussion
SMRI-based ASD detection results obtained with the UNet33, GAN33 and SAGAN33 architectures on the Axial, Coronal and Sagittal slices are presented in this section. In all three architectures, adjacent 3-slices are input to reconstruct next three adjacent slices, and the reconstruction error is minimized during model training.

Slice reconstruction Quality
Exemplar reconstructions achieved with UNet33, GAN33 and SAGAN33 for adjacent slices corresponding to the Axial modality are shown in Figures 5 and 6 respectively. Fig. 5 depicts a healthy test sample, while Fig. 6 presents an ASD sample. The first column in both figures depicts the input slices, and the second column presents the actual nextthree slices. Columns 3-5 present reconstructions with the UNet33, GAN33 and SAGAN33, and parentheses values specify reconstruction Peak Signal-to-Noise Ratio (PSNR). In both cases, SAGAN33 achieves the highest PSNR and captures vivid details compared to UNet33 and GAN33. UNet33 trained with the L2 loss function reconstructs blurry images. GAN33 reconstructs images with good structural quality, while still performing inferior to SAGAN33. In UNet33 and GAN33, convolutions are limited to only the local domain of the convolution kernels, causing the network to overlook significant global structures. However, the SA mechanism in the SAGAN33 effectively captures global dependencies. Sample Coronal and Sagittal ASD slice reconstructions achieved by SAGAN33 are presented in Figures 7 and 8 respectively. Evidently, the SAGAN33 adequately captures sMRI connectives. Ventricles of ASD subjects are larger and thicker than those of healthy subjects, as seen from Figures 5 and 6. This observation is echoed by domain experts [37]. Clearly, visual cues in ventricular areas allow clinicians to distinguish ASD from healthy subjects, and likewise, can enable sMRI-based ASD diagnosis.

ASD detection
Given a test sMRI slice-stack, the next three slices are reconstructed via the encoder-decoder networks described above, and the reconstruction loss compared against the threshold specified in Eqn. (3). This threshold determines whether the sample is a healthy (inlier for which mean reconstruction loss < τ avg ) or ASD (outlier, mean reconstruction loss > τ avg ) sample, enabling unsupervised ASD detection.
We employed different metrics to train the encoder-decoder networks described above, and to compute the distance between the original and reconstructed test slices. Table 4 specifies the loss function employed for model training as the objective metric, while the measure used to compute the distance between actual and reconstructed slices is specified as the distance metric. The L2 and cosine distance metrics defined in Eqn. 1 and 2 were used for evaluating test samples. For model training, the UNet33 was trained with the L2 objective, while the GAN33 and SAGAN33 models were trained with the WGAN-GP+100L1 or WGAN-GP+100L1+Cosine loss objectives.
Accuracy and AUC scores achieved by the UNet33, GAN33 and SAGAN33 networks with the different objective and distance metrics are listed in Table 4. When the L2 distance metric is used for classification, the UNet33 model performs worst due to poor reconstruction quality to achieve an accuracy of 36.95%. Without SA modules, the GAN33 network reconstructs slices with adequate structural detail and produces a fair accuracy of 65.21%. The SAGAN33 which incorporates self-attention modules achieves the best reconstruction quality, and correspondingly the best ASD detection accuracy of 80.43%. Overall, the SAGAN33 outperforms GAN33 by over 15%.
We also employed the cosine metric, and L2+Cosine measure as the distance metric with the SAGAN33 model. Table 4 confirms that the use of alternate distance metrics improves ASD detection accuracy. The cosine distance metric is more sensitive to outliers, and improves detection accuracy by over 2%. A combination of the L2 and cosine metrics further improves detection performance, achieving an accuracy of 84.78% and an AUC of 0.63. Finally, a SAGAN33 model incorporating cosine distance in the objective metric, along with the L2+cosine distance metric achieves the highest accuracy of 86.95% and an AUC of 0.71. Also, while the small size of our dataset can make the models prone to overfitting, the GAN33 and SAGAN33 networks effectively address this issue via regularization applied in the objective. Table 5 extends the results in Table 4, and presents the confusion matrix values for different objective-distance metric combinations employed with the SAGAN33 network. Given that the test set mainly comprised ASD samples (Table 1), we note that the sensitivity or true-positive rate gradually increases as the distance metric changes from L2 to L2+cosine loss. The true-negative rate (or specificity) also increases slightly when the objective metric is modified to include the cosine loss. Cumulatively, these results convey that both the objective and distance metrics impact ASD detection sensitivity and specificity.
We also note here that the longitudinal sMRI scans used in this study are heterogeneous, and were collected with different scanner settings (from different sites). Empirical results reveal that the proposed approach is robust to input data variations, and can be used in real-world situations where it is practically difficult to standardise scanning setups.

sMRI Imaging Modalities
To examine whether the detection performance is impacted by the sMRI imaging modality, the best performing model SAGAN33 was input with Axial, Coronal and Sagittal slices. Results are reported in Table 6, and the corresponding Receiver Operating Curve (ROC) graph is plotted in Figure 9. The model objective was to minimize the WGAN-GP + 100L1 + Cosine loss, while the distance metric employed at test time was the L2 + Cosine loss. Among individual models, the Sagittal slices performed worst and Coronal slices best, achieving 20.4% higher accuracy and 0.17 higher AUC over the Sagittal slices. Utilizing multimodal information for ASD detection was found to be more beneficial than unimodal slices. Higher detection accuracy was obtained on combining the Axial and Sagittal slices, and the best accuracy/AUC was achieved with a combination of the Axial and Coronal slices. Training the SAGAN33 with slices from all three imaging modalities however did not enhance the overall accuracy or AUC.

GRAD-CAM Visualization
To understand the features learned by the SAGAN model for encoding sMRI visual cues, we present two different visualization maps in Figures 10 and 11. We used the Python package ELI5 to visualise Gradient-weighted class activation map (Grad-CAM) [71]. The Grad-CAM is used to create class-specific heatmap visualizations in order to highlight the salient regions in the sMRI slices. Green or blue shades in the heatmap represent lesser importance, implying that the corresponding features are less significant from the model viewpoint, while the yellow, red and orange shades represent regions of moderate-to-high importance, implying that those features are attended to by the model in order to model information or make inferences regarding the specific class. Figure. 10 visualizes outputs of the first SAGAN33 convolution layer for an exemplar healthy and ASD subject. Some visual differences can be noted from the class-specific heatmaps; given that the SAGAN33 is trained with healthy samples, very little attention can be noted on the ventricular regions which are key areas characterizing ASD subjects (see Figure. 6). Visualizations from the fourth SA layer for three adjacent Coronal slices of a healthy vs ASD subject are presented in Figure 11. We can see that the model emphasizes on the hypothalamus, hippocampus, and amygdala, which are considered to be significant for ASD diagnosis [72]. The high-intensity regions reflect areas of interest to the model at prediction time.

Comparison with other works
We compare our work with others which employ ABIDE I [22] cross-sectional sMRI data to highlight the utility of longitudinal slices for ASD detection. Results are summarized in Table 7. Most baselines [73], [74], [75] and [52] employ brain region-specific features for model training. Outlier-based ASD detection similar to ours, via single sMRI slice reconstruction, is proposed in [39]. Differently, we (a) utilized the longitudinal ABIDE-II data [23], and (b) learned holistic structural connectives by reconstructing three-slice stacks in this study. Among baselines, a highest accuracy of 96.6% is achieved by [39], while our approach produces the second highest accuracy of 95.65%. However, our model is trained with 20 times fewer data than in [39]; these results point to the effectiveness of employing multiple scans per subject acquired at different time-points for ASD diagnosis. To motivate the utility of longitudinal sMRI data, we applied our best performing SAGAN33 model on cross-sectional sMRI scans corresponding to the Axial, Coronal and Sagittal planes. We randomly collected cross-sectional sMRI samples from ABIDE I [22], so that the total number of train and test samples equalled the longitudinal data size in this study. The results obtained are summarized in Table 8. When compared with Table 6, rows 1-3, we see that detection accuracies with longitudinal data results are superior by 17-28%, even if the accuracy/AUC trends are consistent for the different modalities. The Coronal modality performs best achieving an accuracy of 63%, and an AUC of 0.64. Overall, these results support our rationale to perform ASD detection with longitudinal instead of cross-sectional data.

Conclusion and future work
We employ a GAN-based encoder-decoder framework on longitudinal sMRI slices, where the error between the reconstructed and actual adjacent three slice stacks is utilized to determine ASD samples as outliers. Three architectures, namely, the UNet, GAN and SAGAN were examined for reconstruction quality and therefrom, ASD detection performance. The SAGAN incorporating self-attention modules achieves the best reconstruction and detection accuracy, while the UNet trained with the L2 objective produces blurry reconstructions and the worst performance. Furthermore, both the objective metric employed for model training and the distance metric used for computing the reconstruction loss are found to significantly impact detection performance. The WGAN-GP+100L1 objective considerably improves performance of the GAN and SAGAN networks, while employing the cosine similarity instead of, or in combination with, the L2 norm as the distance metric also increases detection sensitivity. Among other findings, of the three sMRI images planes-Axial, Coronal and Sagittal, the Coronal mode yielded the highest accuracy, outperforming the Sagittal mode by around 20%. This implies that sMRI structural connectivity is best encoded by Coronal information.
Empirical results also revealed that the imaging modes are complementary; multimodal inputs improved accuracies over unimodal data by more than 5%. Grad-CAM visualizations depicting regions-of-interest to the network showed attention to the hypothalamus, hippocampus, and amygdala regions, considered important for ASD diagnosis [72]. Comparisons against ASD detection works examining cross-sectional data convey that longitudinal sMRI slices enable comparable performance with far fewer training data, and modeling structural brain connectivity with multiple scans over time per subject is beneficial. Our unsupervised outlier detection framework would detect any deviation from the norm; apart from ASD, our framework could be extended to potentially tackle other disorders such as Attention deficit hyperactivity disorder (ADHD), Schizophrenia, etc. Future work will focus on these extensions. Another interesting line of exploration would be to utilise multi-task learning for exploiting complementarities in (substantially available) cross-sectional and (sparse) longitudinal data for improving prediction accuracy as in [76]. We will also investigate architectures alternative to SAGAN such as Dense-Attentive GAN [77].