Machine learning approaches and applications in genome wide association study for Alzheimer’s Disease: A systematic review

Machine learning algorithms have been used for the detection (and possibly) prediction of Alzheimer’s disease using genotype information, with the potential to enhance the outcome prediction. However, detailed research about the analysis and the detection of Alzheimer’s disease using genetic data is still in its primitive stage. The aim of this paper is to examine the scientific literature on the use of various machine learning approaches for the prediction of Alzheimer’s disease based solely on genetic data. To identify gaps in the literature, critically appraise the reporting and methods of the algorithms, and provide the foundation for a wider research programme focused on developing novel machine learning based predictive algorithms in Alzheimer’s disease. In our study between January 1, 2010, until September 21, 2021, we have reviewed different articles within PubMed, Web of Science, and Scopus to research into keywords and phrases linked to Alzheimer’s disease and machine learning tools, including Artificial Neural Networks, boosting, and random forests. Articles were reviewed for inclusion, then retrieved, and assessed for risk of bias using Preferred Reporting Items for Systematic Reviews and Meta-analyses criteria. A pool of 150 abstracts, 65 full texts was evaluated and 24 studies were considered in the review. Machine learning methods in the reviewed papers performed in a wide range of ways (0.59 to 0.98 AUC). Our study indicated that high risk of bias in the analysis can be linked to feature selection, hyperparameter search and validation methods.


I. INTRODUCTION
One of the most significant scientific issues in the human genome is the study of genetic variants connected to complex illnesses. The bulk of genome-wide association studies (GWA) [1] attempt to identify genetic variations that may be connected to complicated illnesses. SNPs (single nucleotide polymorphisms) are known to be the most prevalent genetic variations, with around 10 million SNPs in the human genome [2].
A single nucleotide site is one in which a significant fraction of the population has exactly two (of four) unique nucleotides. There are two known ways in which SNPs play a significant role in the disorders' complications. First, by changing the protein's structure. Second, via altering the protein's quantity. This process is referred to as SNP functionality. It is too costly to genotype millions of SNPs. As a result, an appropriate subset of SNPs must be obtained in order to correctly represent the remaining SNPs.
Genetic association research attempts to explore genetic risk factors by identifying statistical connections between genotypes and phenotypes (disease of interest). The most popular method for determining the genetic connections of complicated disorders is to conduct case-control studies in unrelated individuals.
Machine learning (ML) is an alternative to established approaches for genetic prediction. Following developments in deep learning, it has grown in prominence in recent years [3]. The same can be said about the scaling-up of datasets and computational capacity. In statistical genetics, where the effects of a large number of factors on an outcome are difficult to anticipate, such techniques are intriguing because of their limited capability to operate in high dimensions and identify relations across genes [4]. There have also been more requests to employ machine learning to handle the complexity of diseases such as Alzheimer's disease [5]. However, the accuracy of machine learning approaches in predicting Alzheimer's disease using genetics is still vague, and a new review of prediction models across a variety of outcomes and predictors discovered that logistic regression (LR) provided high accuracy, and hence the use of machine learning in this research field is questionable [6].
Various reviews have examined genome-wide association research and genetic prediction in relation to ML. For example, Bracher-Smith et al [7] examined machine learning algorithms for identifying mental diseases based only on genetic information. On the other hand, Madhukar et al. [8] discuss bioinformatics ideas for leveraging sequencing data to predict sample-specific medication susceptibility. Upstill-Goddard et al. [9] provided a review of machine learning methods in genetic epidemiology for detecting gene-gene interactions. The most essential machine learning approaches and the circumstances that must be addressed when applying these algorithms to genomic challenges were discussed in articles including [10]. The goal is to find and analyze GWAS concerns that require computational approaches instead of or in addition to biostatistical methods [11]. Data mining and machine learning computational methodologies, as well as bioinformatics methods for embedding pre-existing biological knowledge into data analysis algorithms, were the focus of other research work carried out by Wu and Zhao [10]. A review of illness prediction based on singlenucleotide polymorphisms has been undertaken by Ho et al. [12]. A recent systematic review of ML algorithms in SNP data of Alzheimer's disease is shown by Rowe et al. [13]. However, the main limitation of the research paper provided by Rowe et al. [13] is the fact that they have utilized machine learning as a keyword for their search and presentation instead of investigating specific ML techniques. Hence, their review paper does not provide sufficient details on the contribution of various research in relation to the use of ML for the analysis of SNP data of AZ. Their inclusion criteria involved studies which combined SNP data with other forms of data. As far as we are aware, there have been no reviews of studies which have developed ML models to predict AD outcomes from SNP data specifically.
As a result, in our review study, we have looked at extensive literature review for the ability of machine learning (ML) techniques to predict Alzheimer's disease risk using genetic based on the Genome Wide Associations study. The goal of the review is to identify gaps in the literature, critically appraise the reporting and methods of the algorithms and provide the foundation for a wider research program focused on developing novel machine learning based predictive algorithms in AD. It should be noted that studies in which models were also tested on simulated datasets or other chronical diseases alongside Alzheimer's disease are also considered in this comprehensive review paper. Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) criteria were followed for writing this review. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

A. BIG DATA
Big data is a collection of big, structured or unstructured data sets that typical database systems struggle to manage. "Big data" refers to the tools and procedures that enable an organisation to produce, utilise, and store huge volumes of data with storage facilities [14]. Big data is usually defined by five characteristics. When considered in the context of genetics, they are: volume, GWAS requires the genotyping of thousands or millions of genotypes for each participant which in total for a complete GWAS produce a massive amount of data; variety, in large data systems, a mix of data types may need to be stored and handled, GWAS usually stores data in several files, a file for genotype data and file for individuals' information; velocity, the fast advancement of GWAS has been aided by the availability of genotyping technologies. These genotyping technologies were created expressly for assaying more than one million SNPs, such as sequencing the whole human genome in one day [15]. Veracity: errors in the genotyping process can result in data quality concerns that are difficult to spot, which can have a significant impact on a study's biological results [16]. Variability: GWAS dataset can be stored in different formats. However, it is preferable to save the data in a binary formatted file, which results in a large decrease in file size and a significant increase in computing performance. Figure 1 shows a graphical representation of the big data in GWAS.

B. MACHINE LEANRING
Machine Learning (ML) simulates human learning by allowing computers to recognise and gain knowledge from the actual world, as well as enhance performance on particular tasks depending on this new information. ML was explored as a separate discipline in the 1990s [17], despite the fact that the earliest notions of ML (with different terminologies) were developed in the 1950s. Apart from computer science, ML algorithms are being applied in a variety of fields, including business [18], advertising [19], and medicine [20].
Learning is the process of gaining information, because of their ability to reason, humans naturally learn from their experiences. Conventional computers, on the other hand, do not learn by thinking rather by following algorithms. There are several machine learning algorithms presented in the literature nowadays. They may be divided into groups based on how they approach the learning process, supervised, unsupervised, semi supervised, and reinforcement learning are the four primary classes [21]. Figure 2 shows ML types: a) Supervised learning works with data that has been labelled; in the instance of GWAS, the SNPs data is entered as inputs with corresponding labels, the ML model will automatically generate patterns and produce predictions for new unseen inputs; b) Unsupervised learning learns pattern from unlabeled data inputs. in the area of genetics, unsupervised learning can be used to cluster genes that have a common characteristic; c) Semi-supervised learning, the model accepts labelled and unlabeled datapoints; d) reinforcement learning is to feed the model with unlabeled data, then the model generates predictions, which can be approved by providing feedback to the model on whether that prediction was correct or not.
With the growth in processor speed and memory size, machine learning has become increasingly popular. As a result, the discipline currently contains a wide variety of algorithms that learn, draw conclusions, or infer facts through mathematical or statistical analysis [22]. The number of scholarly articles proposing modifications or combinations of machine learning algorithms continues to rise [23,24]. As a result, machine learning algorithms have been classified according to their intended use.

1) ARTIFICIAL NEURAL NETWORK
Artificial neural network (ANN) is a densely connected network of hundreds or even millions of fundamental processing nodes loosely modelled after the human brain. The vast majority of today's artificial neural networks are "feedforward," meaning that information only goes one way through them and they are organized into layers of nodes. However, there exist other types of ANN that accept feedback connections. These are mainly known as recurrent neural networks. They are characterized by their "memory," which allows them to impact current input and output by using knowledge from previous inputs. While typical deep neural networks presume that inputs and outputs are independent of one another, recurrent neural networks' output is reliant on the sequence's prior components. While future occurrences may be useful in establishing the outcome of a series.
Nodes in a layer can be fully or partially connected to the nodes of a previous layer from which it obtains the data. Similarly, the nodes are connected and send data to the nodes of the succeeding layer. Figure 3 illustrates an artificial neural network.
The process of training an artificial neural network starts by randomly setting the values of weights and thresholds. The input layer receives the training data, which is subsequently multiplied and combined in different complex ways until it reaches the output layer. The values of weights are continually adjusted throughout the training process.
Initially, the perceptron of a very basic artificial neural network consisted of only two inputs and one output [25]. This setup enables the creation of a basic classifier that can discriminate between two groups. ANN then evolved into a Multilayer Perceptron, which consists of three layers: input, hidden and output. This development has allowed us to solve more complex non-linear problems [26].
Due to the increase in the volume of data and the complexity of the problems associated therewith, a new subset of machine learning algorithms was established known as deep learning. Deep learning (DL) excelled with its ability to automatically learn characteristics from data and the relationships between data points [27].
An artificial neural network architecture with numerous hidden layers and neurons is the basic architecture in deep learning. Various designs have been suggested, and many of them have found success in various applications including the analysis of genetic data. Convolutional neural networks are deep learning structures inspired by human visual cortex models that have been widely used in image recognition. Recurrent artificial neural networks, which imbue neurons with dynamic behavior, have emerged as the most popular approach for dealing with time series data and natural language processing [28].
Deep learning is a powerful tool for GWAS data analysis, mainly because the amount of data is enormous, far beyond our limited reasoning abilities. In genetic applications, a deep ANN can be built with nodes representing genetic elements (SNPs) and arcs indicating connections (interactions) between the elements.

2) SUPPORT VECTOR MACHINE (SVM)
SVMs are supervised learning algorithms that can be used to solve issues in classification and regression [29]. In highdimensional spaces, where the number of features exceeds the number of observations, SVMs are well-known for their efficacy. The purpose of SVMs is to find a separating hyperplane with the maximum distance to each class. Different kernel functions are available for SVMs, and the selection of a kernel function depends on both the type of problem and the number of observations [30].
Data points can be classified into different classes based on where they fall -on which side of the hyperplane. A hyperplane's dimension depends on the number of features. If there are only two input features, then it is a line. However, if there are three input features, then it becomes a plane. As the number of features increases, similar to genetic data, it becomes increasingly difficult to imagine the dimensions pf the hyperplane. A support vector is a data point that is closer to the hyperplane and is more influential on the position and orientation of the hyperplane. The main idea of SVM is to transfer data into higher dimensions in order to find a suitable boundary that separate the classes in nonlinear classification problems. However, as the number of dimensions increases, computations inside that space become more expensive. SVM uses a kernel trick to overcome this burden, allowing it to calculate high-dimensional relationships without transferring data into higher dimensions. In the case of GWAS, usually there are two classes: cases and controls. Whereas SNPs represent the features, the SVM is capable of classifying a case and control given the feature set (SNPs). This is done when the SVM is given labelled data points, i.e., SNPs and the output. Figure 4 illustrates a support vector machine model of cases and controls.

3) RANDOM FOREST (RF)
A random forest is a learning algorithm that develops a powerful overall classifier through an ensemble of decision trees. The trees formed by random forests are usually trained using the bagging method. The bagging method explains that by combining different learning models, the overall performance can be improved [31].
Because datasets with big size subsets tend to increase computational complexity, a small subset size decreases the difficulty of deciding on the number of characteristics to separate. As a result, reducing the number of features to be used in the training of the model enhances the algorithm's learning speed. Figure 5 illustrates the work process of a RF model. Firstly, the model randomly selects individuals from the original dataset to build new datasets. Each of these newly created datasets will contain the same number of features (SNPs) as the original one. These will be referred to as bootstrapped datasets. Several trees are constructed, and each tree is trained using random features (subset) from the feature set (input) of the bootstrapped datasets. When a RF makes a prediction on a new data point, it will pass the datapoint through each tree and the predictions are recorded. The model then checks all predictions and outputs the majority vote as the final prediction. The process of combining results from multiple models is known as aggregation.
The bootstrapping process ensures that the model does not use the same data in every tree, which helps the model to be more robust. On the other hand, the random feature selection helps reduce the correlation between the trees. RF can be mathematically summarized with the following equations [32]: Where is the data variable and represent the dependent variable.
Where stands for the total number of classes and stands for the particular class (in our case, case or control). Furthermore, tk is included in the fraction of total votes for class .

4) NAIVE BAYES
Naive Bayes is a basic learning technique that use Bayes' rule in conjunction with the strong assumption that attributes are conditionally independent given the class [33]. Despite the fact that this independence assumption is frequently broken in practice, naive bayes classification accuracy is generally competitive. Because of this, as well as its computing efficiency, naive bayes is commonly used in practice. Posterior probability is Bayes theorem is calculated as follows [33]: Where: P(c) is the prior probability of class. P(x|c) is the probability of predictor given class. P(x) is the prior probability of predictor.
In a genetic analysis scenario, let P(c|x) be the probability for a new data point X=<SNP1, SNP2…SNPn> to belong to class c (case or control). This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

III. Search strategy
The systematic review relied on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) tool to review studies that have used machine learning algorithms in the analysis of genome wide association study data. Scops, PubMed, and Web of Science were searched for conference and journal articles matching terms like "machine learning" and "genome wide association study". Full search queries are available in the supplementary material file. On 20 September 2021, articles were searched in their title, keywords, or abstract. The search was limited to the English language, and date was set between 1 January 2010 and 20 September 2021. The first author reviewed all the abstracts for inclusion. Full text was assessed when the abstract was relevant to the inclusion requirement.

IV. Study selection
Through a two-stage screening procedure, we determined whether the studies retrieved by the search engine were eligible. We examined the paper titles and their abstracts first, then eliminated any papers that did not meet the inclusion criteria. Finally, the complete text of all studies that were considered relevant was evaluated using the same screening process. Studies were considered eligible if they met the following requirements: • predicting Alzheimer's disease clinical outcomes or Alzheimer's disease alongside other diseases.
• only considering articles that use GWAS as dataset, excluding any articles based on text or imaging data for the detection of Alzheimer's disease.
• primary research rather than review papers.
• The entire manuscript is considered, instead of simply an abstract or notes. These requirements were chosen to ensure inclusion of all studies investigating Alzheimer's disease, even when another disease is being studied. The restriction of the genetic data to be of type GWAS only is because we wanted to investigate the ability of ML models in this specific area of genetics, which has become popular during the past decade. After selecting the relevant papers, an analysis of each paper was conducted, considering the following questions and conditions: 1) Which ML models were employed?
2) The type and source of the data.
3) Are there any pre-processing steps conducted? 4) What is the overall performance of the model? 5) Which hyperparameter optimization methods were utilized? 6) What feature selection techniques are used? 7) Are there any reported genetic markers?
Papers published between January 2010 and December 2021 were considered. Prior to this time period, ML approaches have been applied in research. However, in the last decade, there has been a surge in interest in machine learning in biological research; as a result, we only found a few research studies prior to that period.
Articles focusing on other forms of genetic data, such as gene expressions or uncommon variants, were not included. Authors that integrated SNPs with additional types of biological markers, such as MRI and PET, were excluded.

V. Data extraction
A structured data collection form was created to help in the extraction of elements. The form includes extraction of general study characteristics such as author/s, type, study objective, and publication year as well as the characteristics of the population used in the study, such as source of data, feature size, and sample size. Finally, other ML models and data preprocessing methods details were also extracted (unbalanced outcomes, other data pre-processing steps, ML models, performance measurements, feature selection methods, and results).

VI. Results
Following an initial search, we identified a total of 283 articles from Scopus, Web of Science and PubMed databases. This number was reduced after removing duplicates, resulting in 165 studies. These were then further reduced to 65 after evaluating if both titles and abstracts met the inclusion criteria. The full texts were then subjected to a more in-depth study, with publications that failed to meet the inclusion requirements after a thorough examination being removed. At this point, 24 articles remained to be included in the review. This number of articles us expected due to the relatively new appearance of ML technologies in this specific are of human genetics as well as the difficulties in obtaining authorization for large GWAS datasets. Figure 1 shows a graphic representation of the selection process. As shown in Table 1, all of the included studies were conducted after 2010, with more than 50% of the included papers published after 2015, four studies published in 2020 [19][20][21][22] and three in 2021 [23][24][25].

A. AREA OF STUDY
Our extensive research indicated that there are three main areas where researchers apply machine learning techniques in genome-wide association studies. In the first domain, a prediction model is developed to classify healthy and unhealthy individuals based on their genetic data. The second domain is developing multiple intelligent models to identify new genetic markers associated with a particular chronical disease of interest. The third method is to develop models to discover epistatic interactions. The percentage area of use of machine learning in GWAS for Alzheimer's disease is further illustrated in Fig 7.

B. DATA SOURCES
Large genome-wide association studies datasets were used for training the ML models in every paper. To test the association of each SNP with a phenotype, the genome data of each participant needs to be recorded.
The majority of the researchers (13 articles) used the genome wide association study dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Four articles [34][35][36][37] used the AD GWAS dataset from the Translational Genomics Research Institute. One article [38] used a dataset from the National Alzheimer's Coordinating Center. One article [39] used a late-onset AD dataset provided by the Harvard Brain Tissue Resource Center and Merck Research Laboratories. Two articles [40,41] requested datasets from the This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

C. DATA PRE-PROCESSING
The vast majority of researchers involved some data preprocessing steps before passing the data to the model. It is well-recognized that quality control in GWAS data is an essential pre-processing step for any analysis of genotype data, especially when studying phenotype associations, as it can have a strong influence on the end results [42]. QC and filtering procedures were performed on individuals and their SNPs in most of the papers using PLANK software. Only four articles [39,40,43,44] did not conduct any quality control procedures.
A number of articles [43,[45][46][47][48] incorporated APOE (a gene whose polymorphic alleles are the most important genetic predictors of Alzheimer's disease risk [49]) genotyping into the dataset as a pre-processing step. The authors of [50] selected the top 2500 SNPs according to their p-value. An x2 statistical test was used in [51] to distinguish between the high and low informative subsets of SNPs. In [52], the authors retrieved only SNPs on the 19th chromosome to train their model. As GWAS is extremely high in terms of dimensionality, all researchers tried to reduce their dimensions by selecting a subset of SNPs. In most papers, authors used logistic regression with different p-values, mainly between (10-2 and 10-8) to find the most significant SNPs for a phonotype and, based on that, a subset of SNPs was conducted for further analysis. Both [43,44] used the Boruta algorithm to find an appropriate set of SNPs for their model. Whereas in [35] the ReliefF algorithm was utilized. In [53,54], the authors extracted a set of known genes related to AD based on metaanalyses of GWAS catalogued on AlzGene database. While in reference [50], the authors sorted SNPs according to p-value using summary statistics from the International Genomics of Alzheimer's Project (IGAP). El Hamid et al [46] selected SNPs from the top 10 AD candidate genes listed on the AlzGene database, a Chi Squared Attribute Evaluator with a ranker search method was used for ranking SNPs in order to select the most important SNPs to further reduce the feature set size. In [55], SNPs that resulted from the intersection of 3 statistical approaches were selected, including the Allelic test, the Genomic test, and the regression test. Number of features selection testing was used in [56] where they looked at the effectiveness of these tests through the classification task. Wang et al. [39] considered only the SNPs that reside protein coding exons according to GENCODE (a public research consortium) then further excluded the SNPs on Xchromosome

D. TECHNICAL DETAILS
From a technical side, eight of the research papers used a single machine learning algorithm. In what follows, we will provide the details for the machine learning methods utilized, as well as model performance, Sample size and Hyperparameter Search.

1) MACHINE LEARNING METHODS
On genetic data from the Alzheimer's disease neuroimaging initiative phase 1 dataset, the Naive Bayes, tree augmented Naive Bayes, and K2 learning algorithms were used for early detection of the illness. Based on the p-value criterion (p-value 0.05), the greatest classification accuracy was attained with 500 SNPs. The NB and K2 learning algorithms reached an overall accuracy of 98 percent and 98.40 percent respectively [45]. Based on genetic data from 188 controls and 176 AZ patients, a deep learning model for AZ prediction was constructed and evaluated. Using Convolutional Neural Networks and Multilayer Perceptrons, the model attained an area under the curve of 0.9 and 0.93, respectively [38]. Suggest using biologically motivated SNP selection as an input to RF for predicting patient risk of developing AZ. The findings reveal that non-disease-related SNPs perform similarly to or better than disease-related SNPs. As the identification of novel relevant markers is the most important effort in GWAS. These findings suggest that SNPs from unrelated sets might be new candidates for Alzheimer's disease [53]. In a GWAS data collection of 550 controls and 861 cases, two distinct techniques were designed to find SNPs linked with AD. In the first technique, employed logistic regression to filter the data depending on a p-value threshold, resulting in a subset of SNPs that were then used in by random forest to perform a multi-locus analysis. On the other hand, in the second technique, pre-select loci for input into the RF classifier using biological information and logistic regression analysis. The first method yielded 199 SNPs. Using 10-fold CV in RF modelling, these SNPs, together with other SNPs that associated to AD, produced a predictive subgroup for AD prediction with an average error of 9.8%. With the second method, 19 variations were discovered. These variations were incorporated of a model that includes APOE and GAB2 SNPs to predict AD risk, and the model achieved a 10-fold CV average error of 17. First, they divided the genotyping data into 30 subgroups, each with 5000 SNPs. This resulted in a filtered collection of 2943 SNPs for TV-GroupSpAM. The technique was then repeated, but this time just with the SNPs that had been chosen in the initial round. This has resulted in a final collection of 126 SNPs. TV-GroupSpAM completed the analysis in 16 days, which was faster than the two control techniques [52]. RFs and NB were two prominent algorithms employed in the evaluated research. NB is known for its ease of construction. However, due to the nature of GWAS data, which contains strong interrelationships between SNPs, which runs counter to the naive notion that all input characteristics are independent [61]. Because of their capacity to avoid overfitting, RFs are a popular classifier [62]. However, because the classifier requires extensive hyperparameter searching, utilising RFs to predict disease risk may be difficult. Neural networks are strong prediction algorithms that can understand non-linear relationships in large and complicated datasets. In certain cases, NN may infer data associations that are beyond the purview of other ML approaches. NNs are notorious for being difficult to employ, tune hyperparameters, and prone to overfitting. Especially, in the case where a dataset has many more features than observations [63]. As SVMs are well-known for their simplicity and predictive accuracy, and as a result, they are often used in prediction modelling [64]. Whereas Bayesian modelling approaches have various characteristics that make them valuable in a wide range of genetic data analysis tasks. They enable the merging of data and domain expertise. They also allow for easier understanding of the causal links between variables. While Bayesian models are a valuable technique to describe expert knowledge, getting the knowledge from the experts in a manner that can be turned into probability distributions may be problematic [65].
The process of choosing a model architecture entails balancing model underfitting and overfitting, often known as the biasvariance trade-off [66]. When a low-capacity model is adopted in relation to the issue complexity and dataset size, underfitting is prevalent. Underfitting can be mitigated by using a more parameter-rich model or using less regularisation during training. More serious is model overfitting, which occurs when the evaluator overestimates the generalisation performance on previously unknown data. A surprisingly poor performance on the test set compared to the training set is an indicator of overfitting. Overfitting is often avoided by employing a validation set inside the trainset for performance estimations, as well as numerous regularisation algorithms, for example, early stopping [67].

2) MODEL PERFORMANCE
AUC was used in over half of the research papers (55% models) as permeance measurement of the models. There was also information on a variety of categorization metrics and model fit assessments such as accuracy, sensitivity and specificity. Out of bag error was used in few papers (Supplementary table 8).
Internal validation was reported by around 68% of models. The most of them used k-fold cross-validation, a resampling technique that includes testing a model on each of k distinct divisions of a dataset and then training on the remaining k-1 folds. The most popular method was 10-fold cross-validation (CV). A random split between training and testing sets was employed in one study for internal validation [38]. Internal validation was not mentioned in five of the total studies. Within studies, the performance of models varied depending on the machine learning algorithm used, the sample size, and the number of features. The AUC for Alzheimer's disease models ranged from 0.59 to 0.98, with the greatest AUC observed utilizing a deep residual network [47]. On the ADNI phase 1 dataset, Nave Bayes produced a similarly high AUC (0.97 AUC) [43]. In [45] the authors reported the accuracy of 92.68 while using 100 SNPs a feature set size, this has risen to 98.4 by increasing the number of features to 500 SNPs. Despite the fact that there are few studies, deep learning outperforms other techniques of machine learning [38]. The outcomes of utilizing an ensemble of machine learning algorithms appear to be superior to using a single method [50]. A variety of techniques have been used to genotype-based categorization of Alzheimer's disease patients and healthy controls using GWAS data, with different accuracies reported. However, because the predicted effect of genotype on sporadic AD prevalence is minimal, these extremely high classification accuracies might be the product of overfitting. Osipowicz et al [44] found that if feature selection is undertaken before splitting the data into training and testing sets, the techniques are prone to overfitting. As a result, it was discovered that it was preferable to avoid selecting features used to create the model based on data contained in the testing set. According to the authors, the anticipated classifier performance for currently available dataset sizes is between 0.55 and 0.7 (AUC), and greater accuracies reported in the literature are most likely the consequence of overfitting. Despite of performing well in terms of prediction accuracy in some datasets, ML still suffers from selecting informative SNPs and build accurate prediction models.

3) SAMPLE SIZE
In the study of brain illnesses, small sample numbers are a prevalent problem. The vast discrepancy between the number of characteristics and the amount of data accessible in genetic data adds another layer of complexity: A GWAS can include over a million SNPs, each of which can be considered a feature; however, the average number of participants is significantly lower. The overall sample size was relatively small, with 11 of the total research papers using the dataset sizes ranging from 364 to 485. A dataset of over 1000 samples was utilised in 9 investigations. A dataset with sample sizes in the 700s was utilized in two research. One study involved the use of imbalance class distribution but did not report if any steps were taken to re-balance the distribution prior to the analysis [54] (supplementary table 6).
The amount of data utilised in several of the listed research was insufficient to properly explore the possibilities of machine learning technologies. Larger datasets can aid in the accurate selection of candidate SNPs, such as those used in meta-analysis [68]. Research can be focused on developing machine learning models that works well for data were the number of features much larger than the number of samples, which is a typical challenge in GWAS. Simulated datasets can be a good starting point in developing ML models that can be then tested on real datasets.

4) HYPERPARAMETER SEARCH
The majority of hyperparameter searches went unreported or were ambiguous, few models being reported as having been utilized with default parameters. The type of search and tuning for a single model resulted in ambiguous reporting since it was unclear whether these circumstances applied to other models in the research. As a result, it's possible that most research considered many hyperparameter options but did not mention this (Supplementary Table 7).

E. REPORTED GENETIC MARKERS
Most of the studies (19 papers) reported associated SNPs with Alzheimer's disease according to their analysis. Five studies reported the SNP rs429358 as an Alzheimer's disease risk factor which is found in the ApoE gene's fourth exon. Han [36] found interaction among the SNPs: rs1931565 and rs4505578 with APOE. A list of high SNPs reported by each article is stated in Supplementary Table 10.

VII. Discussion
The majority of studies conducted for each model only reported the measurements of either discrimination, or This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. classification. Only few of the studies publicized the measure of calibration used within the models [37]. Model building is based on many fundamental properties, including model calibration [69]. Model calibration has been identified as a fundamental aspect which is lacking in the field of generic prediction literature [70]. The Model Calibration compares the anticipated probability of the result happening with the observed probability. When looking at the presented work of authors, the discrimination measures should also be presented along with the classification's measures of specificity, accuracy, and sensitivity as they display all the available information on long projected probabilities. Within genetics and machine learning, the most commonly used and available discrimination metric is the AUC. The hyperparameter optimization is used to determine how machine learning models navigate the bias-variance trade off and learn from the data [71]. It is surprising that this was unobserved or exposed to only a number of manual tests. To ensure any of the models are neither under fit nor over fit, the use of hyperparameters should be investigated in systematic ways. Hyperparameter selections often decrease overfitting, hence it is of upmost importance that search is used within candidate predictors such as genomics. This is applicable to those areas which have a minor or small number of samples over features.
In all of the papers reviewed, there was no presentation of any decision analytical methods, which are used to assess the clinical values of prediction models. The larger aim, objective of these research works is to assist doctors in determining the correct prognosis for patients and aiding them in treatment and planning the decision making. With this being said, no research has been seen to address the use of the model in realworld clinical trials and practices. A number of reasons varying in description can be given as to why machine learning cannot be seen and believed to be difficult to use in the healthcare profession and settings [72]. ML algorithms are often opaque in terms of how the prediction was formed and how various predictors contributed to the final decision. This may have a declining effect on the legitimacy and acceptability of the model's predictions within decision making for healthcare professionals. For model duplication, a degree of transparency is needed to be used in other datasets. It was brought to our attention that most studies model reporting, along with the model development, were not thorough enough in their transparency to be permitted to be used in other datasets. This therefore insinuates that most models will have an inadequate amount of evidence to support their accuracy in various settings, as well as being impractical when used in real-world healthcare settings [73].
Reporting criteria and guidelines for incorporating machine learning algorithms have a chance of increasing ML acceptability. Further studies in the future would profit from attempting to evaluate clinical utility along with the potential effects [74]. To begin with, the accumulation of information and reporting the recommendations for the verification and design of clinical prediction models would be a sensible starting point. New research [75][76][77], has dwelled and dived into the potential ethical issues which may come to light when using machine learning models. It is of utmost importance that algorithms and generated models be publicly accessible as well as comprehensive and transparent in their reporting. This would allow the promotion of clinical utility along with independent external validation across multiple contexts.

A. THE CURSE OF DIMENSIONALITY
A well-known problem with genetics data is that the number of attributes is much larger than the number of samples, introducing a challenge not only for ML algorithms but for DL and hence the general statistical approaches [78]. To avoid the curse of dimensionality, feature selection and feature extraction are often used [79,80]. Some researchers [81][82][83][84] try using multiple datasets to provide a larger number of samples to balance the number of features and samples. Other researchers used association analysis techniques to reduce the dimensions of the original dataset to a significantly smaller number, mostly using logistic regression and selecting only SNPs that passed a threshold level of p-value [45].

B. MODEL INTERPRETATION
One of the main concerns when using machine learning approaches is the ambiguity in decision-making. Owing to the architecture of the machine learning algorithms, it is not yet clear how the learned patterns are formed, therefore making a full image of how the model reached a specific output from inputs a difficult task [85]. Researchers, especially those working in healthcare domain applications, prefer to use white box methods to understand the decisions made by these applications since they are very sensitive, and any small error could be costly [86]. In [47], the last convolutional layer of the last res-block was made transparent to extract features used to investigate the interpretability of the DL model.

C. HYPERPARAMETERS TUNING
One of the challenging tasks in ML is the tuning of the model's hyperparameters. For instance, Artificial Neural Networks require several hyper parameters, including learning rate. A very small learning rate value could take a long time to converge or get stuck at a local minimum. In contrast selecting a large value for the learning rate, it will process parameters faster but most likely lead to an oscillation. Thus, this is the most important hyperparameter and a careful decision should be made. Tuning the parameters is critical step of building an accurate ML model, there are few practices existing in choosing the right set of hyperparameters, such as the Grid search [83] for tuning the parameters of Deep ANN. Whereas Bellot et al [87] used genetic algorithm for hyperparameter optimization. Many researchers randomly test few combinations of hyperparameters and select the best values according to the model performance.

D. IMBALANCE CLASSES
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182543 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Another obstacle that limits the ability of machine learning models is when the number of samples is extremely different in each class. As the aim of ML for classification problems, such as classifying healthy vs unhealthy [82] or responding to disease treatment vs unresponsive to the treatment [88], is to obtain an efficient model for such discrimination cases, a satisfactory number of samples per class should be provided.

VIII. Risk of Bias
Within each paper that aimed to develop a prediction model, the risk of bias was estimated for the optimal model. All of the models showed signs of bias, mainly in the area of analysis. Risk of bias (ROB) was assessed following PROBAST [89] guidelines. Participants, predictors, results, and analysis are the four areas that PROBAST is divided into. There are a total of 20 questions in these four categories to help with structured ROB judgement. The answers to these questions are recorded in supplementary ROB Table. Participants' ROB was rated low in all the included studies. PROBAST for predictors is designed to help the researcher determine if the processes for measuring biomarkers were the same for all study participants. Procedures for collecting predictors in references [35,37,40,41,56] were not provided. As a result, ROB for predictors was judged to be unclear. The procedures for collecting biomarkers are outlined in public materials provided by ADNI. For all subjects, predictors generated from blood samples or MRI images were gathered using the same techniques. As a result, the procedure of acquiring predictors was assessed to have a low ROB for articles that used the ADNI dataset.
During analysis, the models showed a high ROB (as illustrated in Fig 8). It should be noted that the amount of the data used to create the model, the incorrect or unjustifiable handling of missingness, the removal of registered individuals prior to analysis, predictor selection using univariable approaches, and inability to account for overfitting were all common causes.

IX. CONCLUSION
In order to decide where to focus their research efforts, GWAS researchers need a comprehensive picture of the trends in ML algorithm utilization. This paper offers a rigorous examination of the machine learning techniques employed in the GWAS of Alzheimer's disease.
A total of 24 research publications were included following meticulous filtering based on exclusion criteria. There has been, and currently is a rising interest in the utilization of machine learning to aid in predicting AD results. Deep learning algorithms are becoming more popular in GWAS analysis, especially when conventional Artificial Neural Networks are used. Approaches of transfer learning are still a study topic. The trends and opportunities described are confirmed by a chronology of the number of primary studies published in recent years.
Regardless of the incursion of research publications, A very limited number of papers met the standard criteria of clinical prediction tools, with non-making their models obtainable in a format which is either evaluable or usable. When speaking of the improvements and enhancements of current machine learning prediction algorithms and how they are built and verified, we can identify a few aspects which can be looked into, beginning with the use of a larger scale data source which is also richer and more diverse in the data it withholds, along with an improvement in model architecture, and finally, providing thorough reports on the development method used and the final model. To be able to use such approaches and evaluate their use, highly required improvements are needed within reporting and Machine learning research design. Within this specific framework, the use of guidelines, along with reporting requirements used for implementing machine learning algorithms, could aid in the surge in value of investigations. Future work will investigate ML models that have been applied to genetic and image data for Alzheimer's disease.