A Novel Deep Neural Networks Model Based on Prime Numbers for Y DNA Haplogroup Prediction

Most of the Y chromosome (Ychr) region (approximately 95%) passes unchanged from father to son, except by the gradual accumulation of single-nucleotide polymorphism (SNP) mutations. This results in mutations being inherited together, where all males in the direct family will have an identical pattern of variations. These mutation patterns serve as markers and can be mapped into clusters known as Y DNA haplogroups. Besides lineage tracing, haplogroups have been associated with male infertility, semen parameters, and, more recently, disease progression in several populations. Thus, haplogroup prediction research is gaining importance because of the increasing interest in personalized medicine. Of note, there are two approaches to predicting haplogroups, where the difference lies in the genetic markers: short tandem repeats (STRs) or SNPs are inputs to the haplogroup prediction tools. STRs are not without limitations, as similar STR haplotypes exist between haplogroups, and this reduces the effectiveness of STR-based haplogroup prediction tools. By contrast, current SNP-based haplogroup prediction tools are computationally expensive. There have been no studies to date that leverage traditional machine learning and deep learning algorithms to identify mutation patterns using SNPs only, and this paper proposes a novel SNP-based deep neural networks (DNNs) model. However, DNNs suffer the curse of dimensionality and become computationally expensive with large datasets. Thus, this paper overcomes the limitation of the network by proposing a novel feature extraction method based on prime numbers that computes features in either the forward or reverse direction of the SNPs data. Our experimental results show that the model achieves a categorical cross-entropy loss value as low as 0.001 on the training dataset and as low as 0.039 on the test dataset.


I. INTRODUCTION
The Ychr is well-known for encoding sex-determining genes and a few other male-specific genes through translocation and transposition. Ychr is known to undergo genetic degradation: over the last 300 million years, the ancestral autosome that evolved to the human Ychr lost all (∼1500 genes) except for ∼78 of its genes. Thus, the Ychr was commonly regarded as a genetic wasteland. Jacobs et al. [10] first described a frequent loss of Ychr (LOY) in hematopoietic cells of aging men in 1963. For many years, researchers accepted the genetic wasteland view, and LOY was believed to be phenotypically neutral and an age-related phenomena [11], [12]. However, a recent study [13] seems to suggest the opposite; LOY may be involved with disease progression in various organs.
The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . These studies [14]- [17] show that the frequency of the LOY in the cancer genomes ranges between 15-80% in different types of cancer disease.
An interesting attribute of Ychr is that the male-specific region is a nonrecombining region, and constitutes approximately 95% of the chromosome [18]. This region is passed down from father to son unchanged, except by the gradual accumulation of SNP mutations [19]. As a result, any mutations that occur on a Ychr will always be inherited together. These mutations trace the lineage on the Ychr, where all males in the direct family will have an identical pattern of variations. Thus, Ychr is used exclusively in surname testing and forensic identification of male offenders or victims via lineage tracing [20].
Correct interpretation of the mutation patterns can further improve our understanding of population and migration history. These mutations serve as markers and can be mapped into clusters known as Y DNA haplogroups (sometimes known as haplogroups or Y haplogroups) [19]. A haplogroup represents a group of people who have inherited common genetic markers from the same most recent common ancestor. Such information is useful not only to trace the paternal ancestry of an individual but also population events (e.g., migrations and bottlenecks) [21]- [23]. Moreover, various subgroups indicate different geographic signatures. For example, the following subgroups of haplogroup R represent particular geographic subregions: R2a for South Asia, R1b1c for Africa/Middle East, R1a1a1g for (Eastern) Europe and R1a1a1f for West Asia [24].
There are two approaches to predicting Y DNA haplogroups, where the difference lies in the genetic markers, STRs or SNPs, are inputs to the haplogroup prediction tools. Table 1 summarizes a few well-known STR-based and SNP-based prediction tools, including their advantages and limitations. STRs are not without limitations, as pointed out in more recent studies [2], [19], [36]- [38]. One major limitation is the existence of similar STR haplotypes between haplogroups [36], [39]. The study in [36] reported similar haplotypes between the haplogroups: B and I2, C1 and E1b1b1, C2 and E1b1a1, H1 and J, L and O3a2c1, O1a and N, O3a1c and O3a2b, and M1 and O3a2. As expected, such similarities reduce the accuracy of STR-based prediction tools. This conclusion is supported in separate studies [37], [38], [40], which suggest SNP analysis as a second validation step if accurate predictions are required. Thus, the most recent studies use a combination of STR and SNP as genetic markers [41], [42] for haplogroup predictions. The researchers in [41] used a phylogenetic tree of SNPs to represent haplogroups of samples from a known dataset. The tree was then used as a ground truth to facilitate variant findings in STR haplotypes.
However, with regard to phylogenetic trees, yHaplo, as indicated in Table 1, is the current state-of-the-art of the SNP-based haplogroup prediction tools but suffers from computational complexity. Moreover, the SNP alleles need to be on the correct strand as DNA is double-stranded. The definition of strand has been controversial [43]- [45]. The most intuitive definition of a strand uses the human genome reference as the forward strand; however, this has not been standard practice. Thus, the SNPs for Ychr need to be validated with the International Society of Genetic Genealogy (ISOGG) database [46] or other resources [47] to ensure correct reference and alternate allele labeling.
There is no doubt that the ISOGG database and the strand information are crucial for the ground truth labeling of haplogroups and subgroups of samples even in a machine learning model. However, upon labeling, a machine learning model can learn the distinctive mutation patterns of the training datasets in either direction (forward or reverse) of the sequence data and infer the new sample's haplogroup from the training set. This is possible because a machine learning model derives its power from its ability to differentiate patterns from the data itself. To the best of our knowledge, there have been no studies that leverage traditional machine learning and deep learning algorithms to identify those patterns using SNPs only, and this paper proposes to use deep learning. By contrast, the most recent work on haplogroup prediction uses a combination of STRs and SNPs with traditional machine-learning algorithms [42].
Deep learning is the emerging generation of artificial intelligence techniques and has grown immensely in applications in fields [48] ranging from computer vision to speech to signal processing to sequence and text prediction, and more recently, to bioinformatics and computational biology [49]- [54]. The basic models in deep learning are derived from artificial neural networks (ANNs), and deep neural networks (DNNs) is one such example. The contributions of this paper are summarized below: VOLUME 8, 2020 • Proposes a novel SNP-based DNNs model to learn the patterns between haplogroups and subgroups.
• Proposes a novel feature extraction method based on prime numbers to select SNPs as features. This is because DNNs suffer from the curse of dimensionality and become computationally expensive when used for a large number of SNPs and samples.
• Provides a comprehensive analysis of the experimental results. We show that patterns learned in either direction (forward or reverse) of the SNPs data can be used to infer the haplogroup of a new sample.
The rest of this paper is organized as follows. Section II provides an overview of Ychr. This overview can be skimmed by readers already familiar with this field but may serve as a useful tutorial for those new to the problem. Next, Section III describes our DNNs model for the problem domain, including the selection of hyperparameters and the intuition behind them. Section IV presents our methodology, describes the experiments, and discusses the obtained results. Finally, we draw some conclusions in Section V.

II. OVERVIEW OF Y CHROMOSOME
In this section, we provide a brief overview of the cytogenetic structure of Ychr, the definition of mutation and the Y DNA haplogroups.

A. CYTOGENIC STRUCTURE
Ychr is the smallest chromosome in humans (∼60 MB) [55] and is acrocentric. It has a short arm (Yp) and a long arm (Yq) that is separated by a centromeric region that is important for chromosome segregation during male meiosis [19]. There are three major regions in Ychr: pseudoautosomal, euchromatin and heterochromatin. Only pseudoautosomal regions, indicated by PARs in Figure 1, are involved in meiosis. In theory, a chromosome's whole-genome sequence data allow us to construct reliable phylogeny where the length of the branch is proportional to the number of SNPs, which is then mapped to build a maximum parsimony tree to infer the phylogeny of the sequences [19]. However, in practice, the structure of Ychr is very different than those of other chromosomes, where even with the most advanced technology only certain regions of the Ychr can be mapped unambiguously. These regions are scattered across the chromosome and add up to a length of 9.9 Mb. This ∼10 Mb is known as the callable region of Ychr [56]. Hereafter, we refer to SNPs in this region as callable SNPs.

B. MUTATION
Mutation (sometimes known as polymorphism) is the ultimate source of all genetic diversity and is any change in the DNA sequence. This can range from the substitution of a single base in the genome to small insertions and deletions of a few bases. A mutation only exists when at least two different alleles are present in a population, and both are present at ≥ 1% frequency [57].
An SNP is the most common genetic variation among individuals and involves a single base difference in a single DNA building block that is commonly known as a nucleotide. For example, an SNP may replace the nucleotide adenine (A) with the nucleotide guanine (G). On the other hand, STRs are repeated units of 1-7 base pairs in length, and those with a useful degree of polymorphism have a frequency of 10-30 [57]. The mutation rate refers to the frequency of mutations in a single gene, chromosome or even in an individual over time.

C. Y DNA HAPLOGROUPS
A haplotype is a combination of allele states of polymorphisms on the same chromosome, whereas a haplogroup is a group of similar haplotypes that share a common ancestor with an SNP mutation [57]. In theory, if an individual has a different SNP than another, they can be said to be in different haplogroups. However, in practice, this requires keeping track of millions of groups. Therefore, the Y Chromosome Consortium (YCC) [58] was formed in 2002 to collate all phylogenetically informative SNPs and assign universal nomenclatures to each recognized haplogroup. YCC defined a single capital letter to indicate major haplogroups, where letters A to T have been used. For the subgroups, two nomenclature systems were proposed: lineage-based, where names are alphanumeric (e.g., E1b1b1b1a); and mutationbased, where terminal SNP mutation is used to define them (e.g., E-M81). Both examples refer to the same subgroup of major haplogroup E. To date, 20 major haplogroups have been identified with numerous subgroups [59]. Figure 2 shows the phylogeny tree of the Ychr's SNP data of the 1000 Genomes Project [60], where the tips of the branches represent the major haplogroups.

III. DEEP NEURAL NETWORKS
This section describes a novel DNNs model and the hyperparameters used in this study. To ensure that both the network and hyperparameters are independent of the datasets used in the same problem domain since overfitting is a serious problem in neural networks, we used the preprocessed SNPs of 39 individuals of the CEU population [61] of the International HapMap Project [62] for the initial exploration of various deep learning models.
We selected the DNNs as the deep learning model of interest. Furthermore, recent research has shown than ANNs can handle small datasets [63], and this characteristic of the neural network fits our problem domain as it is challenging to obtain large datasets when compared to other big data domains. Upon selecting the DNNs as the model, we used the Myanmar dataset [61] (see Section IV for details) to search for the optimal hyperparameters.
Hyperparameter optimization is an area of research itself and several approaches ranging from grid search to genetic algorithms (GAs) [64] have been used. In this paper, we implemented a GA to search some of the optimized hyperparameters, while others are chosen intuitively by using a trial-and-error approach. This approach has been used in different problem domain studies [65]- [72]. However, in a new problem domain, it is unknown which hyperparameters are important and what values produce good performance. Thus, we also discuss the most important hyperparameters that we chose for our problem domain and the intuition behind these choices.

A. MODEL
The DNNs model of this paper is presented in Figure 3. The leftmost layer is known as the input layer, while the rightmost layer is known as the output layer. The input layer consists of input neurons that represent the number of SNP-based features. Meanwhile, the output neurons represent the number of unique classes based on either haplogroups or subgroups. By contrast, the hidden neurons in the two hidden layers form the network's bottleneck.

B. HIDDEN NEURONS
There is a trade-off between the number of hidden neurons with the training error rate, where using too few results in underfitting and too many results in overfitting [72]. There has been no census on the exact number; many studies have proposed several heuristics using a trial-and-error approach [69]- [71].
Sheela and Deepa [72] reviewed methods for fixing the number of hidden neurons in neural networks for the past 20 years and showed that their approach gave the lowest training error rate when measured using the Mean Squared Error (MSE) metric on weather datasets. We compared the network's MSE values using the number of hidden neurons obtained using the GA with their approach, and the comparison results are given in Table 2. The lower the MSE value, lower the error rate and the better the estimator. Moreover, to prevent the saturation of hidden neurons, we explored z-score and min-max normalizations, found the latter giving better results. This paper uses the min-max normalization. Similarly, Aksu et al. [73] reviewed the effect of various normalization methods on educational sciences datasets with ANNs and found the min-max normalization yielding the best results. VOLUME 8, 2020

C. ACTIVATION FUNCTION
The activation function provides nonlinear modeling capabilities for networks, and only by adding activation functions, DNNs possess hierarchical nonlinear mapping learning ability. Recently, Nwankpa et al. [68] reviewed the majority of activation functions in deep learning research, including their practical applications in deep learning models. The researchers proposed the use of Softmax activation on the output layer for the multiclass classification problem; this approach is also used in this paper.
By contrast, the activation functions for the two hidden layers are obtained using the GA, where the hyperbolic tangent (Tanh) is the activation function for hidden layer 1, and the exponential linear unit (ELU) is the activation function for hidden layer 2. The study in [74] showed that ''Tanh-Tanh'' combination of activation functions for both hidden and output neurons gave better training performance in multilayered perceptron architectures of neural networks [74]. By contrast, ELU has been used to speed up the training of deep neural networks [75] on computer vision datasets. Instead, this paper proposes ''Tanh-ELU-Softmax'' combination of activation function for both the hidden and output neurons.
To normalize the above activations in the intermediate layers of the network, a batch normalization layer was proposed by [76] and is used only with the first hidden layer, as indicated in Figure 3 as this network has only two hidden layers.

D. GRADIENT PROCESSING
A stochastic gradient with minibatches is used as the convergence depends on the updates and richness of the training distribution and not the size of the training dataset [65]. This characteristic fits our problem domain due to the small sample size issue described earlier. Table 3 shows hyperparameters tuned for gradient processing as well as the recommended values. Bengio [65] researched gradient-based training for deep architectures and recommended values for the initial learning rate and batch size. Moreover, the study in [67] showed that small batch sizes improve the generalization performance of DNNs. By contrast, Goodfellow, Bengio and Courville recommended momentum ranges in their deep learning book [66].  [65] and [66], and values used in the paper.
The epoch is a hyperparameter related to the batch size. The epoch's value was 177, which was obtained using the GA. There are no recommended values for this parameter except through trial-and-error.

E. DROPOUT
Dropout is a technique proposed by [77] to prevent the network from coadapting too much by literally dropping neurons. The recommended dropout rate for the input neurons is 20%, and the rate for hidden neurons is 50%. These values are also used in this paper.

F. GENETIC ALGORITHM
We now describe the GA that we implemented for hyperparameters tuning. GA begins by creating an initial population of DNNs with random values assigned for epoch, hidden neurons, and activation functions. Then, the algorithm selects the top two networks based on a fitness criterion defined by machine learning metrics, described in Section IV-B4, to become parents while discarding the remaining networks. The parent networks are used for breeding children through the cross over and mutation steps. The pseudocode GA of the algorithm is presented as Algorithm 1.

Algorithm 1 Genetic Algorithm for Finding Hyperparameters
1 pseudocode GA (Data, p, g); Input : Data, population size p and number of generations g. Output: Optimized hyperparameter values for epoch (e), hidden neurons (h 1 , h 2 ) and activation functions (a 1 , a 2 ). 2 Create an initial p-sized population of DNNs. 3 Evaluate the population using fitness criteria defined by the machine learning metrics. 4 Select the top two networks with the highest fitness scores to become parents, P. 5 while current population size < p do 6 Create child C via cross over of P. 7 Mutate C based on some randomness. 8 Add C to new population. 9 end 10 repeat steps 3 -9 until the gth generation.

Many variations of cross over and mutation steps exist.
Of note, we present the pseudocode CO of the cross over step, that we implemented as Algorithm 2. We chose ratio values of between 0 and 1, particularly values 0.75 and 0.25, as they yielded good results while experimenting with various ratios on the training dataset.
Similarly, we present the pseudocode Mut of the mutation step that we implemented as Algorithm 3, where a child is selected randomly. We chose a ratio of 0.50 as it worked sufficiently well on the training dataset.

IV. EXPERIMENTAL RESULTS
This section describes the steps taken in the experiments of this study as well as the justification given for each decision made for each method used, and then provides a comprehensive analysis of the results.  1 , a 2 ).

Algorithm 2 Cross Over Algorithm for Creating a Child
Apart from the full Ychr sequences, which is the highest level of phylogenetic resolution, there are Ychr SNP datasets with low and medium resolutions as a different set of SNPs might have been sequenced depending on the objective of their study. Moreover, unlike other big data problem domains, it is not easy to acquire large sample sizes to obtain a good representative of the haplogroups and their subgroups.
Thus, to represent our problem domain accurately, we used the Myanmar (Myan) [61] data consisting of 106 samples (i.e., individuals) as the training dataset. Phase 3 of the 1000 Genomes project (1000 Genomes) [60] data, consisting of 1,233 samples, was used as the test dataset. Both these datasets use the hg19 build. We have separated the training and test datasets so that the test set is invisible to the network's training to confirm the network's actual predictive power.

B. METHODOLOGY
In this section, we present our experimental framework of preprocessing, feature extraction methods, classification and evaluation metrics.

1) PREPROCESSING
For the Myan data, we used a Python script to preprocess the genotype data to a sequence of 0s and 1s as the files were in PLINK format. A ''0'' indicates that the SNP is similar to the reference allele, whereas a ''1'' indicates that a mutation has occurred at that position. However, such preprocessing was not required for the 1000 Genomes data, as it was already in a sequence of 0s and 1s. Thus, we used BCFtools to extract the required fields from the VCF file before using a Python script for further processing. Moreover, if a sample has a missing value at a particular position, the said position is removed from all of the samples. Therefore, 1000 Genomes data contains 58,732 SNPs, and Myan data contains 2,041 SNPs. Hereafter, we refer to the above preprocessed SNPs (in the form of 0s and 1s) as simply SNPs.
As we are using stratified 3-fold cross-validation, we ensure each haplogroup or subgroup has at least four samples. Of note, the stratified K-fold is a commonly accepted cross-validation technique. Stratification is a process that ensures each fold is a good representative of each class, which is dependent on the number of samples. This technique splits the dataset into groups known as folds and, in our case, three folds, where two folds are used for training the model while the remaining fold is used for testing the model. This resulted in 1,216 samples and 27 classes of haplogroups, and 985 samples and 76 classes of subgroups, for 1000 Genomes data. For the Myan data, there were 76 samples and 4 classes of subgroups. The samples were labeled based on the ground truth information of their datasets.

2) PRIME-NUMBER-BASED FEATURE EXTRACTION METHOD
DNNs suffer the curse of dimensionality. Thus, if we take each SNP as a feature, it becomes computationally expensive and encourages overfitting. This inspired us to use mutation information, i.e., the number of mutations in a particular range, as a feature instead. To do so, we divided the SNPs into fixed partitions and calculated the number of mutations, i.e., the mutation rate per partition. However, our exploration results show it is computationally expensive if we search for the optimal partition size, where we fall back to the same computational complexity problem that we are trying to overcome.
While exploring other approaches, we found that using number theory sequences, particularly prime number sequences as partition sizes, works sufficiently well. Thus, each prime number indicates the mutation rate per partition and is used as a feature. Mutation rates are computed in forward and reverse directions of the SNPs data, as shown in Figure 4. This results in two prime-number-based feature sets: forward and reverse. For example, the first feature in the reverse feature set has a mutation rate of ''2'' as there are two mutations in the first partition. On the other hand, the first feature in the forward feature set has a mutation rate of ''0'', as there are no mutations in the first partition.
The pseudocode CPF of the algorithm that computes the forward feature set is presented as Algorithm 4. CPF begins by making a left-to-right scan over the SNPs data to calculate the cumulative sum. A cumulative sum is a sequence of partial sums of a given sequence, and in our case, the sum of VOLUME 8, 2020 mutations of the SNPs data. This aids in efficient processing as positions closer to the current partition are used to obtain the mutation rate instead of having to recompute them every time we need to partition the data. A sequence of prime numbers forms the boundaries of the partitions, where each subsequent pair denotes the lower and upper boundary of the partition, respectively. By contrast, the reverse feature set is computed by modifying pseudocode CPF to make a right-toleft scan at step 2. Input : SNPs data and upper limit of prime numbers u. Output: Forward feature set of prime-number-based features, F. 2 Make a left-to-right scan over SNPs to compute the cumulative sum and store it in C. 3 Compute a list of prime numbers based on u and store it in P. 4 while not end of list P do 5 Scan C for the positions indicated by the first subsequent pair of prime numbers. 6 Compute the mutation rate by taking the difference of the number of mutations between those two positions. 7 Store the mutation rate in F. 8 Repeat steps 5-7 for the next subsequent pair of prime numbers. 9 end

3) CLASSIFICATION
For the first experiment, we used forward and reverse primenumber-based feature sets on the DNNs classifier described in Section III. The purpose of this experiment was twofold. First, it assessed if our model can accurately classify haplogroups and subgroups using prime-number-based features, and which feature set gave the best results. Second, it elucidated the complexity of the model with the new feature sets. Thus, for baseline comparisons, we used each SNP as a feature on the same DNNs classifier. Hereafter, we refer to this feature set as baseline.
To further gauge the performance of the prime-numberbased feature set, and whether our model can classify accurately using fewer SNPs than the callable SNPs, variations in the number of features were also assessed.

4) EVALUATION METRICS
We evaluated the described experiments using four machine learning evaluation metrics: categorical cross-entropy loss indicated as loss, accuracy, prediction and recall. All values were between 0 and 1, and the standard deviations are shown in brackets. Cross-entropy loss increases as the predicted labels continue to differ from the predicted labels. By contrast, accuracy is a metric tied to precision and recall. High precision and recall scores show that the classifier is giving accurate results (high precision), and the majority of the results are positive (recall).

C. RESULTS AND DISCUSSION
For the training dataset (consisting of 4 classes), both the forward and reverse feature sets (311 prime-numberbased features each) used about the same running time (∼6.4 seconds in contrast to the 2,041 baseline features, which used ∼8.6 seconds). However, the forward feature set achieved the lowest loss value (cf. 0.001 with 0.093 on the reverse feature set and 0.007 on the baseline feature set). Table 4 summarizes the results on the test dataset consisting of 27 classes when all of the 5,943 prime-number-based features are used in contrast to the 58,732 baseline features. Once again, both the forward and reverse feature sets use about the same running time (∼ 200 seconds in contrast to the baseline, which used ∼ 32 minutes). The forward feature set achieved the lowest loss value (cf. 0.039 with 0.072 on the reverse feature set and 0.056 on the baseline feature set). On the other hand, Tables 5 and 6 show the variation in results for the forward and reverse feature sets. We can conclude that the reverse feature set gave better prediction results. This is because when 512 features were used in both feature sets, the reverse feature set gave the lowest loss value (cf. 0.038 with 0.092 on the forward feature set). Of note, we ran the model on the same settings for the subgroups     consisting of 76 classes but obtained high cross-entropy loss due to the small sample size in some of the subgroups. The results are presented in Tables 7, 8 and 9 in Appendix.

V. CONCLUDING REMARKS
We conclude from the above results that our novel DNNs model can be used to predict haplogroups accurately, and that our novel feature extraction method reduced the model's running time without degrading the prediction performance. Second, the prime-number-based feature sets can be used to achieve practical performance results where the results are similar. This is because the forward feature set gave better results for the training dataset, and the reverse feature set gave better results for the test dataset. We believe this can be attributed to the position of SNPs that indicate whether a mutation has occurred. As a result, if the same positions were chosen, a mutation pattern might not be seen as the SNPs are currently being represented as a sequence of 0s and 1s. Third, we showed that the accuracy improves as more features are used. This further indicates that fewer SNPs than the callable SNPs described in [56] may be used to differentiate haplogroups.
In this paper, we have 1) proposed a novel SNP-based DNNs model that learns patterns between haplogroups and subgroups; 2) proposed a novel feature extraction method based on prime numbers that reduces the computational complexity of DNNs; and 3) provided a comprehensive analysis of the experimental results that show patterns learned in either direction (forward or reverse) of the SNPs data can be used for haplogroup and sub-groups predictions. Table 7-9.