A Machine Learning Framework for Early-Stage Detection of Autism Spectrum Disorders

Autism Spectrum Disorder (ASD) is a type of neurodevelopmental disorder that affects the everyday life of affected patients. Though it is considered hard to completely eradicate this disease, disease severity can be mitigated by taking early interventions. In this paper, we propose an effective framework for the evaluation of various Machine Learning (ML) techniques for the early detection of ASD. The proposed framework employs four different Feature Scaling (FS) strategies i.e., Quantile Transformer (QT), Power Transformer (PT), Normalizer, and Max Abs Scaler (MAS). Then, the feature-scaled datasets are classified through eight simple but effective ML algorithms like Ada Boost (AB), Random Forest (RF), Decision Tree (DT), K-Nearest Neighbors (KNN), Gaussian Naïve Bayes (GNB), Logistic Regression (LR), Support Vector Machine (SVM) and Linear Discriminant Analysis (LDA). Our experiments are performed on four standard ASD datasets (Toddlers, Adolescents, Children, and Adults). Comparing the classification outcomes using various statistical evaluation measures (Accuracy, Receiver Operating Characteristic: ROC curve, F1-score, Precision, Recall, Mathews Correlation Coefficient: MCC, Kappa score, and Log loss), the best-performing classification methods, and the best FS techniques for each ASD dataset are identified. After analyzing the experimental outcomes of different classifiers on feature-scaled ASD datasets, it is found that AB predicted ASD with the highest accuracy of 99.25%, and 97.95% for Toddlers and Children, respectively and LDA predicted ASD with the highest accuracy of 97.12% and 99.03% for Adolescents and Adults datasets, respectively. These highest accuracies are achieved while scaling Toddlers and Children with normalizer FS and Adolescents and Adults with the QT FS method. Afterward, the ASD risk factors are calculated, and the most important attributes are ranked according to their importance values using four different Feature Selection Techniques (FSTs) i.e., Info Gain Attribute Evaluator (IGAE), Gain Ratio Attribute Evaluator (GRAE), Relief F Attribute Evaluator (RFAE), and Correlation Attribute Evaluator (CAE). These detailed experimental evaluations indicate that proper finetuning of the ML methods can play an essential role in predicting ASD in people of different ages. We argue that the detailed feature importance analysis in this paper will guide the decision-making of healthcare practitioners while screening ASD cases. The proposed framework has achieved promising results compared to existing approaches for the early detection of ASD.


I. INTRODUCTION
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition associated with brain development that starts early stage of life, impacting a person's social relationships and interaction issues [1], [2]. ASD has restricted and repeated behavioral patterns, and the word spectrum encompasses a wide range of symptoms and intensity [3], [4], [5]. Even though there is no sustainable solution for ASD, simply early intervention and proper medical care will make a significant difference in a kid's development to focus on improving a child's behaviors and skills in communication [6], [7], [8]. Even so, the identification and diagnosis of ASD are really difficult and sophisticated, using traditional behavioral science. Usually, Autism is most commonly diagnosed at about two years of age and can also be diagnosed later, based on its severity [9], [10], [11]. A variety of treatment strategies are available to detect ASD as quickly as possible. These diagnostic procedures aren't always widely used in practice until a severe chance of developing ASD. The authors in [12] provided a short and observable checklist that can be seen at different stages of a person's life, including toddlers, children, teens, and adults. Subsequently, the authors in [13] constructed the ASDTests mobile apps system for ASD identification as fast as possible, depending on a range of questionnaire surveys, Q-CHAT, and AQ-10 methods. Consequently, they also created an open-source dataset utilizing mobile phone app information and submitted the datasets to a publicly accessible website called the University of California-Irvine (UCI) machine learning repository and Kaggle for more development in this area of study. Over the past few years, several studies have been conducted incorporating various Machine Learning (ML) approaches to analyze and diagnose ASD and also other diseases, such as diabetes, stroke, and heart failure prediction as quickly as possible [14], [15], [16]. The authors in [17] analyzed the ASD attributes utilizing Rule-based ML (RML) techniques and confirmed that RML helps classification models boost classification accuracy. The authors in [18] combined the Random Forest (RF) along with Iterative Dichotomiser 3 (ID3) algorithms and produced predictive models for children, adolescents, and adults. The authors in [19] introduced a new evaluation tool, integrating ADI-R and ADOS ML methods, and implemented different attribute encoding approaches to resolve data insufficiency, non-linearity, and inconsistency issues. Another study conducted by the authors in [13] demonstrates a feature-toclass and feature-to-feature correlation value utilizing cognitive computing and implemented Support Vector Machines (SVM), Decision Tree (DT), Logistic Regression (LR) as ASD diagnostic and prognosis classifiers [17]. In addition, the authors in [20] explored traditionally formed (TD) (N = 19) and ASD (N = 11) cases, in which a correlation-based attribute selection was used to determine the importance of the attributes. In 2015, the authors in [21] investigated ASD and TD children and recognized 15 preschool ASDs using only seven features. Besides that, they conveyed that cluster analysis might effectively analyze complex patterns to predict ASD phenotype and diversity. The authors in [22] contrasted the classifier accuracy of K-Nearest Neighbors (KNN), LR, Linear Discrimination Analysis (LDA), Classification and Regression Trees (CART), Naive Bayes (NB), and SVM for adult ASD prediction. In [23], an ML model via induction of rules was proposed for autism detection, which includes testing on only one dataset and limited comparison. The authors in [17] used LR analysis to build an ML autism classification approach, which also falls into the lack of extensive validation and comparison. The authors in [24] scrutinized autism data and observed that 5 of the overall 65 characteristics are sufficient to detect ASD through attention deficit hyperactivity disorder (ADHD). In 2019, the authors in [25] constructed an RF-based model for the prediction of ASD utilizing behavioral features. In addition, the authors in [26] used LDA and KNN methods to identify ASD Children between the ages of 4 and 11 years. In 2018, the authors in [27] suggested an ASD model based on the RF classifier for children between the ages of 4-11. The authors in [28] evaluated the predictive performance of the Deep Neural Network (DNN) in the diagnosis of ASD utilizing two distinct Adult datasets. In 2019, the authors in [18] constructed a smartphone application programming interface on RF-CART and RF-ID3 for the diagnosis of ASDs of all ages. The authors in [29] assessed the performance of multiple SVM kernels in classifying ASD data for children and explored that the polynomial kernel worked much better. The authors in [1] performed several feature selection techniques on four ASD datasets and found that the SVM classifier performed better for RIPPER-based toddler subset, correlation-based feature selection (CFS) and Boruta CFS intersect (BIC) method-based child subset and CFS-based adult subset. Furthermore, they applied Shapley Additive Explanations (SHAP) method to various feature subsets, which achieved the highest accuracy and ranked their features based on performance. The authors in [30] carried out ensemble ML approaches of Fuzzy K-Nearest Neighbor (FKNN), Kernel Support Vector Machines (KSVM), Fuzzy Convolution Neural Network (FCNN), and Random Forest (RF) to classify Parkinson's disease and ASD. Finally, the classification results are verified utilizing Leave-One-Person-Out Cross Validation (LOPOCV). The authors in [31] performed an evolutionary cultural optimization algorithm to optimize the weights of Artificial Neural Networks (ANN) in classifying three benchmark datasets of autism screening Toddlers, Children, and Adults. The authors in [32] performed an experimental analysis using 16 different ML models, among them, four bio-inspired algorithms, namely, Gray Wolf Optimization (GWO), Flower Pollination Algorithm (FPA), Bat Algorithms (BA), and Artificial Bee Colony (ABC) were employed for optimizing the wrapper feature selection method in order to select the most informative features and to increase the accuracy of the classification models on genetic and personal characteristics datasets. Another study conducted by the authors in [33] combined three benchmark datasets as Toddlers, Adolescents, and Adults and performed a Light Gradient Boosting Machine (LGBM) classifier to classify ASD. The authors in [34] utilized Extreme Learning Machines (ELM) and Random Vector Function Link (RVFL) generalization techniques to classify the Toddlers, Adolescents, and Adults datasets.
This study gathers four standard ASD datasets (Toddlers, Children, Adolescents, and Adults) and initially preprocesses the datasets (manipulation of missing values and encoding). Then, four Feature Scaling (FS) methods including Quantile Transformer (QT), Power Transformer (PT), Normalizer, and Max Abs Scaler (MAS) are undertaken to map the datasets into an appropriate format for further assessments. Thereafter, the feature-scaled datasets are classified by eight simple but effective classification approaches (AB, RF, DT KNN, Gaussian Naive Bayes (GNB), LR, SVM, and LDA), and the best classification models are identified. Meanwhile, we also explore the significance of the FS methods on each dataset by analyzing the experimental outcomes of the transformed datasets. Afterward, four Feature Selection Techniques (FST) i.e., Info Gain Attribute Evaluator (IGAE), Gain Ratio Attribute Evaluator (GRAE), Relief F Attribute Evaluator (RFAE), and Correlation Attribute Evaluator (CAE) are implemented to calculate the risk factors of ASD and rank the most important features of these feature-scaled Toddlers, Children, Adolescents and Adults datasets. Accordingly, this study suggests that ML methods can be applied to help identify the most significant features of ASD detection based on the FST-based feature importance analysis and this will help physicians diagnose ASD cases accurately. Notice that the work presented in [35] may seem somewhat similar to ours. However, the notable differences are as follows. (i) We consider four promising FS methods (QT, PT, Normalizer, and MAS), whereas the three FS methods (Logarithmic, ZScore, and Sine) used in [35] are obsolete nowadays. (ii) After applying each FS method, we find the best FST from a list of IGAE, GRAE, RFAE, and CAE for each dataset to train the ML models, whereas [35] did not consider any such tuning of the FST methods. (iii) We consider eight simple but effective ML models for the prediction, whereas the ML models used in [35] are archaic in this domain. (iv) Finally, we compare more recent works with our proposed model in contrast to [35]. To this end, the key contributions of this paper are summarized as follows.
• We develop a generalized ML framework for early-stage detection of ASD in people of different ages.
• We solve the imbalanced class distribution issue through Random Over Sampler to avoid the ML models being biased towards the majority class samples.
• We select the best Feature Scaling (FS) method to map individual ASD dataset's feature values to improve the prediction performance.
• We investigate eight simple but effective ML approaches on each feature-scaled ASD dataset, analyze their classification performances and identify the best FS techniques for each ASD dataset.
• Furthermore, we also calculate and analyze the feature importance values on each best feature-scaled ASD dataset based on four FSTs to identify the risk factors for ASD prediction.
• Finally, we also perform extensive experiments and comparisons using four different standard ASD datasets. The remaining part of the paper is organized as follows. Section 2 demonstrates the proposed research methodology and material used in the study. Section 3 analyzes the detailed experimental outcomes while Section 4 discusses the comparative results of the progressive works in this domain. At last, Section 5 summarizes and concludes the observations and findings.

A. DATASET DESCRIPTION
We collect the four ASD datasets (Toddlers, Adolescents, Children, and Adults) from the publicly available repositories: Kaggle and UCI ML [36], [37], [38], [39]. The authors in [13] created the ASDTests smartphone app for Toddlers, Children, Adolescents, and Adults ASD screening using QCHAT-10 and AQ-10. The application computes a score of 0 to 10 for every individual, with which the final score is 6 out of 10 which indicates an individual has positive ASD. In addition, ASD data is obtained from the ASDTests app while open-source databases are developed in order to facilitate research in this area. The detailed description of the Toddlers, Children, Adolescents, and Adults ASD datasets are given in Table 1 and Table 2.

B. METHOD OVERVIEW
This research aims to create an effective prediction model using different types of ML methods to detect autism in people of different ages. First of all, the datasets are collected, and then the preprocessing is accomplished via the missing values imputation, feature encoding, and oversampling. The Mean Value Imputation (MVI) method is used to impute the missing values of the dataset. Then, the categorical feature values are converted to their equivalent numerical values using the One Hot Encoding (OHE) technique. Table 1 shows that all four datasets used in this work have an imbalanced class distribution problem. As such, a Random Over Sampler strategy is used to alleviate this issue. After completing the initial preprocessing, the datasets' feature values are scaled using four different FS techniques i.e., QT, PT, Normalizer, and MAS (see their detailed operations in Table 3). The feature-scaled datasets are then classified using eight different ML classification techniques i.e., AB, RF, DT, KNN, GNB, LR, SVM, and LDA. Comparing the classification outcomes of the classifiers on different feature-scaled  ASD datasets, the best-performing classification methods, and the best FS techniques for each ASD dataset are identified. After those analyses, the ASD risk factors are calculated, and the most important attributes are ranked according to their importance values using four different FSTs i.e., IGAE, GRAE, RFAE, and CAE (see the detailed operations in Table 4). To this end, Fig. 1 represents the proposed research pipeline to analyze the ASD datasets and calculate the risk factors that are most responsible for ASD detection.
AB is a tree-based ensemble classifier that incorporates many weak classifiers to reduce misclassification errors [41].  It selects the training set and iteratively assigns the weights depending on the previous training precision for retraining the algorithm. In order to train any weak classifier, an arbitrary subset of the full training set is used and AB assigns weights to each instance and classifier. The following equation defines the combination of several weak classifiers: where H (x) defines the output of the final model through combining the weak classifiers and h t (x) represents the output of classifier t for input x and α t specifies the weight assigned to the classifier. α t is calculated as follows.
where E denote the error rate. The following equation is utilized to update the weights of each training sample-label pair (x i , y i ).
where D t+1 denotes the updated weight, D t specifies the weight of previous level, and Z t sum of all weights.
RF is a decision tree-based ensemble classification method and follows the split and conquer technique in the input dataset to create multiple decision-making trees (known as the forest) [42]. It works in two phases. At first, it creates a forest by combining the 'N' number of decision trees and in the second phase, it makes predictions for each tree generated in the first phase. The working process of the RF algorithm is illustrated below: 1) Select random samples from the training dataset.
2) Construct decision trees for each training sample.
3) Select the value of 'N ' to define the number of decision trees. 4) Repeat Steps 1 and 2. 5) For each test sample, find the predictions of each decision tree, and assign the test sample a class value based on majority voting.

3) DECISION TREE (DT)
DT follows a top-down approach to build a predictive model for class values using training data-inducing decision-making rules [43]. This research utilized the information gain method to select the best attribute. Assuming P i , the probability such that x i D, exists to a class C i , and is predicted by |Ci, D|/|D|. To classify instances in the dataset D, the required information is needed, and the following equation calculates it: where Info(D) is the average amount of information needed to identify C i of an instance, x i D and the objective of DT is to divide repeatedly, D, into sub datasets D 1 , D 2 . . . . . . . . . D n .
The following equation estimates the Info A (D): Finally, the following equation calculates the information gain value:

4) K-NEAREST NEIGHBORS (KNN)
KNN classifies the test data by utilizing the training data directly by calculating the K value, indicating the number of KNN [43]. For each instance, it computes the distance between all the training instances and sorts the distance. Furthermore, a majority voting technique is employed to assign the final class label to the test data. This research applies Euclidean distance to calculate the distances among instances. The following equation represents the Euclidean distance calculation: where D e indicates the euclidean distance, X i denotes the testing sample values, Y i specifies the training sample values and n represents the total number of sample values.

5) GAUSSIAN NaïVE BAYES (GNB)
GNB algorithm follows a normal distribution and is used for classification when all the data values of a dataset are numeric [43]. To compute the probability values of any instance with respect to the class value mean and standard deviation are calculated for each attribute of the dataset. Consequently, for testing, when any instance comes, it utilizes the mean and standard deviation values to calculate the probability of the test instance. The necessary equations are given below: where µ indicates the mean, δ represents standard deviation, x i denotes all samples in a particular column, n indicates the total number of samples and f x presents the conditional probability of class value.

6) LOGISTIC REGRESSION (LR)
Based on a given dataset of independent variables, logistic regression calculates the likelihood that an event will occur, such as voting or not voting. Given that the result is a probability, the dependent variable's range is 0 to 1. In logistic regression, the odds-that is, the likelihood of success divided by the probability of failure-are transformed using the logit formula. The following formulae are used to express this logistic function, which is sometimes referred to as the log odds or the natural logarithm of odds [43]: where p denotes the probability of instance x.
Now, the following equation is used to update the values of the coefficients: Initially, all the coefficient values are 0 and y is the output value for each training sample, where l denotes learning rate, x represents biased input for b 0 and is always 1. It updates the VOLUME 11, 2023 values of the coefficients until it predicts the correct output at the training stage.

7) SUPPORT VECTOR MACHINE (SVM)
SVM is used to classify both linear and non-linear data and mostly works well for high-dimensional data with nonlinear mapping. It explores the decision boundary or optimal hyperplane to separate one class from another. This study used Radial Basis Function (RBF) as a kernel function and SVM automatically defines centers, weights, and thresholds and reduces an upper bound of expected test error [29], [44].
The following equation represents the RBF function: where (||x − x ||) 2 defines the squared Euclidean distance between the two feature samples and δ is a free parameter.

8) LINEAR DISCRIMINANT ANALYSIS (LDA)
LDA is a dimensionality reduction technique but can be used for classification by exploring the linear combination of features [45]. LDA uses the Bayes theorem to estimate the probability. Let us, consider k classes and n training samples that are defined as {x 1 , x 2 . . . . . . . . . x n } with classes z i {1 . . . ..k}. The prior probability is assumed to display as Gaussian distribution φ(x|µ k , ) in each class. The model estimation is defined as follows: where a k denotes the prior probability, µ k defines mean of all classes, indicates the sample covariance of the class means.

III. EXPERIMENTAL RESULTS ANALYSIS A. EXPERIMENTAL SETUP
In order to conduct the experiment, an open-source cloudbased service named Google Collaboratory provided by Google is utilized. The scikit-learn package of Python programming language is used to complete the data preprocessing, feature scaling, feature selection, and classification tasks. In this work, a 10-fold cross-validation technique [46], [47], [48] is utilized to construct prediction models using four different ASD (Toddlers, Children, Adolescents, and Adults) datasets. In 10-fold cross-validation, during training, the datasets are randomly divided into equal 10 folds. During model building, 9 folds are used and training and the remaining one is used for testing. Hence, this procedure is repeated 10 times, and finally, average the results. Here, due to the lack of enough samples in the datasets, 10 The following terms represent the above equations. TP = True Positive; TN = True Negative; FP = False Positive; FN = False Negative; p o is the relative observed agreement among raters; and p e is the hypothetical probability of chance agreement; y is the actual/true value and y is the prediction probability of each observation.

B. ANALYSIS ON ACCURACY
Accuracy represents the actual prediction performance of any classifier. The higher the value of accuracy indicates better prediction and lower the miss-classification. The accuracy values of various classifiers on different feature-scaled datasets are presented in Table 5.
In this case, LDA delivers the best accuracy of 97.12% for the normalizer-scaled Adolescent dataset. Moreover, while investigating the results of the feature-scaled Adult dataset, it is seen that both the QT and normalizer-scaled datasets perform better than the other FS methods. In both of the cases, LDA achieves the best accuracy value of 99.03%. Additionally, the accuracy values of various ML classifiers on feature-scaled Toddlers, Children, Adolescents, and Adult datasets are contrasted in Fig. 2.   in Table 6. Analyzing the precision values of the Toddler dataset, it is found that the AB classifier provides the best precision of 99.95% while PT is used as the FS method. While reviewing the feature-scaled Children dataset, it is noticed that the LR classifier obtains the highest precision of 96.16% for MAS in classifying ASD. Furthermore, inspecting the feature-scaled Adolescent dataset, we observe that DT delivers the best precision of 97.25% while using PT as FS method. Moreover, while investigating the results of the feature-scaled Adult dataset, it is seen that both the QT-transformed datasets perform better than the other FS methods. In that case, SVM achieves the best precision value of 98.16%. Additionally,  the precision values of various ML classifiers on featurescaled Toddlers, Children, Adolescents, and Adult datasets are contrasted in Fig. 3.

D. ANALYSIS ON RECALL
Recall represents a true positive rate and a higher value of recall means the true positive value is high and the false negative value is low. When the true positive is high and the false negative is low that means better prediction. The recall values of various ML classifiers on different feature-scaled datasets are presented in Table 7. While reviewing the recall results of the feature-scaled Toddler dataset, it is observed that AB obtains the highest recall of 98.45% for the normalizer-scaled Toddler dataset.  Investigating the feature-scaled Children datasets, we find that LR delivers the best recall value of 97.72% while normalizer as FS method. Moreover, inspecting the recall results of feature-scaled adolescent datasets, it is noticed that AB achieves the highest recall of 97.36% for normalizer-scaled Adolescent datasets. Finally, we analyze the outcomes of feature-scaled Adult datasets and find that RF, KNN, and LR deliver the highest recall of 100.00% for PT, while DT and KNN obtain the best recall of 100.00% for PT and KNN, LR also obtains 100.00% recall value for MAS-scaled VOLUME 11, 2023  adult's datasets. Besides, we also compare the recall values of various ML classifiers on feature-scaled Toddlers, Children, Adolescents, and Adult datasets in Fig. 4.

E. ANALYSIS ON ROC
The ROC value indicates the ability of any classifier to distinguish between positive and negative classes. The ROC values of various ML classifiers on different feature-scaled datasets are presented in Table 8. While reviewing the ROC results of the feature-scaled Toddler dataset, it is observed that LR obtains the highest ROC of 99.99% for both QT and PT and AB achieves 99.99% for the normalizer method. Investigating the feature-scaled Children dataset, it is found that GNB delivers the best ROC value of 99.73% using normalizer as  the FS method. Moreover, inspecting the ROC results of the feature-scaled Adolescent dataset, we notice that both AB and LDA achieve the highest ROC of 99.72% for QT and MAS-scaled datasets. Finally, we analyze the outcomes of feature-scaled Adult datasets and find that LDA delivers the highest ROC value of 99.99% while using PT and normalizer as the FS methods. We compare the ROC values of various ML classifiers on feature-scaled Toddlers, Children, Adolescents, and Adult datasets in Fig. 5.   The F1-score values of various ML classifiers on different feature-scaled datasets are presented in Table 9. While reviewing the F1-score results of the feature-scaled Toddler dataset, we observe that AB obtains the highest F1-score of 99.14% for the normalizer-scaled Toddler dataset. Investigating the feature-scaled Children dataset, it is found that AB delivers the best F1-score value of 97.02% while using QT and normalizer as FS methods. Moreover, inspecting the F1-score results of feature-scaled Adolescent datasets, we notice that AB achieves the highest F1-score of 97.69% for the QT-scaled Adolescent dataset. Finally, we analyze the outcomes of the feature-scaled Adult dataset and notice  that LDA delivers the highest F1-score value of 99.11% while using PT as the FS method. We compare the F1-score values of various ML classifiers on feature-scaled Toddlers, Children, Adolescents, and Adult datasets in Fig. 6.

G. ANALYSIS ON KAPPA
Kappa score measures the degree of agreement between true class and predicted class. The higher value of kappa means a better prediction which indicates a higher degree of agreement between actual and predicted values. The kappa values of various ML classifiers on different feature-scaled datasets are presented in Table 10. While reviewing the kappa results of the feature-scaled Toddler dataset, it is observed that both the normalizer and MAS-scaled datasets provide the best kappa value and outperform the other FS methods. Consequently, both LR and LDA obtain the highest  MCC of 99.31% for normalizer and MAS-scaled Toddler datasets. Investigating the feature-scaled Children datasets, it is found that AB delivers the best kappa value of 93.78% using normalizer as FS method. Moreover, inspecting the kappa results of feature-scaled Adolescent datasets, we notice that LDA achieves the highest MCC of 94.02% for both QT and PT-scaled datasets respectively. Finally, we analyze the outcomes of feature-scaled Adult datasets and see that both LR and LDA deliver the highest kappa value of 99.02% while using QT and normalizer as the feature scaling methods. Besides, we also compare the kappa values of various ML classifiers on feature-scaled Toddlers, Children, Adolescents, and Adult datasets in Fig. 7.

H. ANALYSIS ON LOG LOSS
The log loss value indicates how close the prediction probability is to the true values. The lower the log loss value, the better the prediction. The log loss values of various ML  classifiers on different feature-scaled datasets are presented in Table 11. While reviewing the log loss results of the feature-scaled Toddler and children datasets, we observe that AB obtains the lowest log loss of 0.0802% and 0.98% for the normalizer-scaled toddler and QT and PT-scaled children. Furthemore, it is noticed that LDA achieves the lowest log loss of 1.12% for QT, PT, and MAS-scaled adolescents datasets. Finally, we analyze the outcomes of feature-scaled Adult datasets and see that both LR and LDA deliver the highest log loss value of 0.16% while using QT and normalizer as the feature scaling methods. Besides, we also compare the log loss values of various ML classifiers on feature-scaled Toddlers, Children, Adolescents, and Adult datasets in Fig. 8.

I. ANALYSIS ON MCC
MCC takes all the coefficient of confusion matrix such as TP, TN, FN and FP into consideration to calculate the degree of correlation. The higher value of MCC represents better prediction and strong correlation between actual and predicted class. While reviewing the MCC results of the feature-scaled Toddler dataset, we observe that both LR and LDA obtain the highest MCC of 99.31% for normalizer and MAS-scaled Toddler datasets. Investigating the feature-scaled children datasets, it is found that AB delivers the best MCC value of 93.88% using normalizer as the FS method. Moreover, inspecting the MCC results of feature-scaled Adolescent datasets, we notice that LDA achieves the highest MCC of 94.25% for both QT and PT-scaled datasets respectively. Finally, we analyze the outcomes of feature-scaled Adult datasets and find that both LR and LDA deliver the highest MCC value of 99.03% while using QT as the feature scaling method. Besi'des, we also compare the MCC values of various ML classifiers on feature-scaled toddlers, children, adolescents, and adults datasets in Fig. 9.

IV. DISCUSSION AND EXTENDED COMPARISON
In the previous section, we analyzed four different ASD datasets to build prediction models for different stages of people. In order to do this, we applied various FS methods to those ASD datasets and classified them utilizing eight different simple but effective ML classifiers and also determined how the FS methods affect the classification performance. Furthermore, we also employed four different FSTs to compute the importance of the features which are more responsible for ASD prediction. Inspecting the experimental findings, the best performing classifiers model predicted ASD with AB (99.  12%), LR, LDA (0.16%) log loss for Toddlers, Children, Adolescents, Adults datasets respectively. After analyzing the experimental outcomes of different classifiers on feature-scaled ASD datasets, it is found that AB for Toddlers and Children, and LDA for Adolescents and Adults outperformed the other ML classifiers in terms of classification performance. Besides, the experimental outcomes implied that the normalizer FS method for Toddlers, normalizer FS method for Children, QT FS method for Adolescents, and QT FS method for Adults showed better performance. Additionally, we calculated the feature importance using the IGAE, GRAE, RFAE, and CAE FST methods on the normalizer-scaled Toddlers, normalizer-scaled Children, QT-scaled Adolescents, and QT-scaled Adults to enumerate the risk factors for ASD prediction. The quantitative results are provided in Table 13, Table 14, Table 15 and  Table 16. This feature importance analysis helps healthcare practitioners decide the most important features while screening ASD cases. To this end, we provide the comparative results of our work with other recent studies in Table 17.

V. CONCLUSION
In this work, we proposed a machine-learning framework for ASD detection in people of different ages (Toddlers, Children, Adolescents, and Adults). We show that predictive models based on ML techniques are useful tools for this task. After completing the initial data processing, those ASD datasets were scaled using four different types of feature scaling (QT, PT, normalizer, MAS) techniques, classified using eight different ML classifiers (AB, RF, DT, KNN, GNB, LR, SVM, LDA). We then analyzed each featurescaled dataset's classification performance and identified the best-performing FS and classification approaches. We considered different statistical evaluation measures such as accuracy, ROC, F1-Score, precision, recall, Mathews correlation coefficient (MCC), kappa score, and Log loss to justify the experimental findings. Consequently, our proposed prediction models based on ML techniques can be utilized as an alternative or even a helpful tool for physicians to accurately identify ASD cases for people of different ages. Additionally, the feature importance values were calculated to identify the most prominent features for ASD prediction by employing four different FSTs (IGAE, GRAE, RFAE, and CAE). Therefore, the experimental analysis of this research will allow healthcare practitioners to take into account the most important features while screening ASD cases. The limitation of our research work is that the amount of data was not sufficient enough to build a generalized model for people of all stages. In the future, we intend to collect more data related to ASD and construct a more generalized prediction model for people of any age to improve ASD detection and other neuro-developmental disorders.
ANWAAR ULHAQ received the Ph.D. degree in artificial intelligence from Monash University, Australia. He is currently working as a Senior Lecturer (AI) with the School of Computing, Mathematics, and Engineering, Charles Sturt University, Australia. He has developed national and international recognition in computer vision and image processing. His research has been featured 16 times in national and international news venues, including ABC News and IFIP (UNESCO). He is an Active Member of IEEE, ACS, and the Australian Academy of Sciences. As the Deputy Leader of the Machine Vision and Digital Health Research Group (MaViDH), he provides leadership in artificial intelligence research and leverages his leadership vision and strategy to promote AI research by mentoring junior researchers in AI and supervising HDR students devising plans to increase research impact.
GOVIND KRISHNAMOORTHY is currently a Clinical Psychologist and a Senior Lecturer with the School of Psychology and Wellbeing, University of Southern Queensland, Australia. His research and clinical practice focus on improving mental health and educational outcomes for children and adolescents. He has collaborated with health services, schools, and community services in implementing place-based and systems approaches to support developmental disorders and mental health concerns in children, adolescents, and their families. VOLUME 11, 2023