N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy

Glycosylation is the most complex post-modification effect of proteins. It participates in many biological processes in the human body and is closely related to many disease states. Among them, N-linked glycosylation is the most contained glycosylation data. However, the current N-linked glycosylation prediction tool does not take into account the serious imbalance between positive and negative data. In this study, we used protein sequence and amino acid characteristics to construct an N-linked glycosylation prediction model called N-GlycoGo. Based on sequence, structure, and function, 11 heterogeneous features were encoded. Further, XGBoost was selected for modeling. Finally, independent testing of human and mouse prediction models showed that N-GlycoGo is superior to other tools with Matthews correlation coefficient (MCC) values of 0.397 and 0.719, respectively, which is higher than other glycosylation site prediction tools. We have developed a fast and accurate prediction tool, N-GlycoGo, for N-linked glycosylation. N-GlycoGo is available at http://ncblab.nchu.edu.tw/n-glycogo/.


I. INTRODUCTION
Glycosylation is the most complex and common post-translational modification and involves the enzymatic attachment of sugars to proteins. Glycosylation affects many important biological processes like protein folding, cell-tocell information transmission, gene expression, and control of cellular metabolism. Four main types of glycosylation patterns are known: N-linked, O-linked, C-linked, and GPI anchors. N-linked glycosylation, the most common, involves the attachment of carbohydrates to the amine group (NH2) of asparagine at the conserved motifs N-X-S and N-X-T, where X can be any amino acid except proline [1], [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou .
To control and predict glycosylation, various genetic or cell culture methods of modification [3] and dynamics [4], genetic engineering [5], and genome models [6] have been used. the construction of these models requires computing tools and biological experimental methods and parameter adjustment training and repeated experiments require considerable time, especially for the mechanistic kinetic models [7]. Although these technologies have high accuracy, the instrument is expensive. Moreover, the large amount of data generated consumes considerable experimental material and labor. Therefore, using machine learning methods to develop tools for predicting glycosylation sites within a few hours is essential. Several prediction tools use amino acid sequences to predict post-translational modification sites.
Publicly available glycosylation prediction tools include NetNGlyc [8], GPP [9], GlycoPP [10], GlycoEP [11], SPRINT-Gly [12], and N-GlyDE [13]. NetNGlyc 1.0 uses artificial neural networks (ANNs) to predict the N-glycosylation sites on human proteins. GPP employs secondary structure (SS) and surface accessibility (ASA) [14] of mammalian protein sequences and then uses random forest (RF) prediction. GlycoPP performs binary profile of patterns (BPP), composition profile of patterns (CPP), and PSSM profile of patterns (PPP) for human protein sequences and then uses support vector machine (SVM) for prediction. GlycoEP performs BPP, CPP, PPP, SS, and ASA coding for eukaryotic protein sequences and then uses SVM to predict. SPRINT-Gly uses deep neural networks (DNNs) to predict glycosylation sites on N-linked and O-linked human and mouse protein sequences. N-GlyDE uses SVM to generate a two-stage prediction model for human glycosylation. Although the current prediction methods are accurate, some problems remain, such as the dataset of the training model is relatively small, amino acid information used for feature encoding is incomplete, and feature selection technology is not used to remove unimportant feature values. The choice of classifier also uses older methods; and several new classification algorithms are available that can greatly improve accuracy. Therefore, we constructed N-GlycoGo to improve the prediction method of glycosylation sites through integrated models [15], use all positive and negative imbalance data, solve the problem of imbalanced data, and develop a more accurate model for predicting glycosylation. In addition to the sequence and structure based features for feature encoding, the subcellular location of a protein contains important information about protein function and is closely related to the signal peptide [16]. Therefore, SignalP-5.0 [17] has been added as a function-based feature. N-GlycoGo uses a total of five coding tools to generate 11 features. XGBOOST [18] was used to build a prediction model for N-linked glycosylation sites. In the independent tests of humans and mice, the highest MCC was 0.957 and 0.738, respectively. From the performance of other tools, it can be seen that the early tools have lower MCC, and glycosylation sites cannot be predicted across species.

II. MATERIALS AND METHODS
N-GlycoGo uses an ensemble model [19] to predict using heterogeneous features.. The flowchart for constructing N-linked glycosylation prediction tools for humans and mice is shown in Figure 1.

A. DATA COLLECTION
The glycosylation data sources used by N-GlycoGo include Universal Protein Resource (UniProt), dbPTM, and O-GlycBase v6.00.

1) UNIPROT
UniProt [20] is a database of protein sequence and annotation data jointly developed by the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB), and Protein Information Resources (PIR), which includes UniProtKB [21], UniRef [22], and UniParc [23].
2) DBPTM dbPTM [24] is a post-translational modifications database that integrates experimentally verified data from multiple databases.

B. DATA PREPARATION 1) HUMAN TRAINING SET
The training set used by N-GlycoGo was obtained from UniProt. Using the post-translational modification information database, search with the keyword glycosylation was done and data labelled CARBOHYD and verified by experiments was annotated (excluding the annotation lines labeled probable, potential, and similar). The experimentally verified glycosylated or non-glycosylated N-linked sites were considered positive and negative sites, respectively. The sequences were fragmented with 21 window size and the glycosylation action site was placed at the center with 10 amino acids on the left and right each. for a blank value, a virtual amino acid ''-'' was added. Thereafter, CD-HIT was used to remove sequences that were more than 30% similar to avoid machine learning over-evaluation. A total of 3836 positive and 18277 negative sites were obtained for humans.

2) MOUSE TRAINING SET
The same is protocol was used to obtain mouse protein data from UniProt as that used for the human training set. A total of 57 positive and 948 negative sites were obtained for mice.

3) HUMAN INDEPENDENT SET
Glycosylation data from different sources were used to evaluate the stability of the model. For humans, data was collected from dbPTM and O-GlycBase. Next, after removing the proteins that appeared in the human training set, CD-HIT was used to remove more than 30% similar sequences and a total of 57 glycosylation sites remained. Thereafter, positive and negative sites were extracted to yield 57 positive and 948 negative sites.

4) MOUSE INDEPENDENT SET
Glycosylation sites of mouse protein data from dbPTM and O-GlycBase were selected for evaluation. Next, after removing the proteins that appeared in the mouse training set, CD-HIT was used to remove more than 30% similar sequences. Finally, 13 glycosylation sites, including 13 positive sites and 145 negative sites were selected.

C. PREDICTIVE MODEL
In the training and independent testing data, the difference in the ratio between positives and negatives is clear. Therefore, ensemble learning is used to construct the model to solve the problem of imbalanced data [19]. N-GlycoGo uses ensemble learning to extract samples from negatives so that the number of negatives and positives for each model are similar; finally, these models are integrated to improve the overall performance. The constructed algorithm includes Random Forest, SVM, and XGBoost.

D. FEATURE ENCODING
N-GlycoGo uses five coding tools to generate 11 features and is divided into three categories: sequence-, structure-, and function-based features.

1) SEQUENCE-BASED FEATURES
iLearn [26] can encode through DNA, RNA, and protein sequences. We used iLearn's binary, AAindex [27], amino acid composition (AAC) [28] and the composition of k-spaced amino acid pairs (CKSAAP) [29]. Binary encodes amino acids in a binary manner. The 20 amino acids are converted into 0 and 1 with 20-dimensional vector encoding to form 20 different combinations of sequence codes; window size 21 is used for sequence encoding. Features of 420 bits are used. It can be the most primitive and direct expression of the composition and distribution of the linear amino acid sequence. AAindex is a database for the physical and biochemical properties of amino acids. It is divided into three sections: AAindex1, AAindex2, and AAindex3. N-GlycoGo only uses AAindex1 because glycosylation is related to peptide binding and has nothing to do with amino acid mutations (AAindex2). Moreover, these peptides are linear and do not form secondary structures (AAindex3). Therefore, only AAindex1 is used. The 531 physical, chemical, and biochemical properties of the data are coded as features. The AAC code calculates the frequency of each amino acid type in a protein or peptide sequence. CKSAAP coding calculates the frequency of amino acid pairs separated by k residues (k = 0, 1, 2, . . . , 5. The default maximum value of k is 5).
The Pse-in-One [30] tool was developed by the Harbin Institute of Technology, and can generate pseudo components of DNA, RNA, and protein sequences. We used three protein prediction modules of this tool-Kmer, parallel correlation pseudo amino acid composition (PC-PseAAC), series correlation pseudo amino acid composition (SC-PseAAC)-and made evaluations according to the output results, taking into account complete protein sequence and window size 21 sequence features to increase feature information. The value of Kmer represents the occurrence frequencies of k adjacent amino acids. PC-PseAAC combines continuous local sequence-order information and global sequence-order information into protein sequence feature vectors. SC-PseAAC is a variant of PC-PseAAC, which combines local sequence-order information and global sequence-order information into a protein sequence feature vector.
WebLogo 3 [31] displays multiple sequences of amino acids or nucleic acids through alignments. The amino acid at each position in the sequence can be stacked with the English abbreviation of the nucleic acid, and the height of the stacked letters represents the relative frequency of the amino acid or nucleic acid at that position. The glycosylation site is conserved. Previous studies on N-linked glycosylation have reported that the glycosylation site N-X-S or N-X-T is conserved, where X can be any amino acid except proline. We used WebLogo 3 to evaluate the frequency value of each amino acid in the positive segment. For a gap, the value was 0.

2) STRUCTURE-BASED FEATURES
NetsurfP-2.0 [32] can predict the structural characteristics of the protein or amino acid sequences through deep learning, including the surface accessibility data for exposed and embedded amino acids, probability of α-helix, β-strand and random coil, data for structural disorder of proteins [33], and phi/psi value of dihedral angles [34] for amino acids.
Protein surface accessibility (relative/absolute surface accessibility, RSA/ASA) includes evaluation of buried or exposed residues and ASA Z-score (Z-score is a prediction of surface area and does not contain structural information). Buried and exposed residues are scored as 10 and 01, respectively, and the Z-scores of RSA/ASA and ASA are added to the score. The secondary structure provides the possibility scores for α-helix, β-strand, and random coil, and the three values are used for scoring.

3) FUNCTIONAL-BASED FEATURES
SignalP 5.0 [17] is based on the amino acid sequence of archaea, gram-positive bacteria, gram-negative bacteria, and eukaryotic proteins through a deep neural network to predict signal peptides (SP) [35] cleavage site. The subcellular localization of the protein depends on the signal peptide [16]. SignalP 5.0 is used to predict the signal peptide cleavage site on the sequence. The prediction result includes the C-score, the score of the original cleavage site, and the S-score. The signal peptide score and Y-score include the score of the cleavage site. Moreover, three values are evaluated.

E. FEATURE SELECTION
N-GlycoGo uses mRMR for feature selection. mRMR is a feature filter, where ''relevance'' and ''redundancy'' are defined using mutual information, correlation, t-test/F-test, distance, etc. A total of three feature ranking results are the output, including max-relevance and MRMR calculated using two schemes of mutual information difference (miD) and mutual information quotient (miQ).

F. ALGORITHMIC ENSEMBLE TECHNIQUES
The simple integration method continuously draws samples from the majority class, making the number of samples of the majority and minority classes the same and finally integrates these models. The main purpose of the ensemble method is to improve the performance of a single classifier. This method constructs several two-level classifiers from the original data and then assembles the predicted results.

G. MODEL EVALUATION
To judge the quality of the model requires certain criteria; therefore, the choice of evaluation indicators is also important. Accuracy (ACC), sensitivity (Sn), specificity (Sp), and Matthews correlation coefficient (MCC) are common indicators used to evaluate machine learning. ACC is the most intuitive evaluation indicator when evaluating models, as shown in equation (1), where TP, FP, FN, and TN, are true positives, false positives, false negatives, and true negatives, respectively. Sn represents the proportion of all positives that are correctly predicted, as shown in equation (2), which reflects the model's ability to predict positives. Sp represents the ratio of all correctly predicted negatives. As shown in equation (3), this ratio shows the model's ability to correctly predict negatives. MCC is a suitable evaluation index when the ratio of positives to negatives is not even. The value of MCC approaches 0, when almost all the predictions are wrong; MCC is equal to 1, when all predictions are correct; MCC = −1, when all predicted results and actual values are opposite, as shown in formula (4).

A. COMPARISON OF ALGORITHM
To solve the problem of data imbalance, we constructed a model in ensemble learning. The prediction method was evaluated by ten-fold cross-validation, as shown in Table 1.
For data with CD-HIT deduplication, XGBoost can reach 0.981 in MCC, which is much higher than 0.96 of traditional SVM and 0.961 of RandomForest.

B. FEATURE ANALYSIS
To explore the importance of each feature in predicting N-linked glycosylation sites in humans, N-GlycoGo uses mRMR to test the top 10 and 100 features from 14383 features for modeling and prediction, and accuracy calculation. As seen in Table 2, the ensemble learning construction model can slightly improve MCC when there are a small number of features, only when the features selected by MIQ schemes are selected, the MCC decreases; however, when the number of features reaches 100, because the features are taken by other schemes, the ensemble learning models have the same and stable predictions; therefore, second stage predictions cannot improve the accuracy of the predictions.
As seen in Table 2, selection of more features does not necessarily increase accuracy and slows down the running speed. Choosing the right number of features can increase the prediction speed and accuracy of the data.

C. PERFORMANCE OF INDEPENDENT TEST
We have compiled the existing prediction models for glycosylation sites in Table 3. The table contains the modeling methods, type of glycosylation, and species for each model. NetNGlyc [8], GPP [9], GlycoPP [10], GlycoEP [11], SPRINT-Gly [12], and N-GlyDE [13] were selected for this study.
NetNGlyc 1.0 uses artificial neural networks (ANN) to predict the glycosylation sites on N-linked human protein sequences. However, it can only predict data for which sequence length is less than 2000. GPP uses SS and ASA to score mammalian protein sequences and uses random forest to predict. GlycoPP uses BPP, CPP, PPP, and ASA + BPP to predict glycosylation sites in prokaryotes through SVM. Gly-coEP uses features such as BPP, CPP, PPP, and ASA + BPP to predict through SVM, and provides four features for users to choose. According to the training set, it is divided into two prediction tools: Standard Predictor (S) and Advanced Predictor (A). SPRINT-Gly uses a Deep neural network (DNN) to predict glycosylation sites on N-linked and O-linked human and mouse protein sequences. N-GlyDE uses SVM to carry out a two-stage sequence prediction model. The first stage provides a prediction score for each protein, and the second stage glycosylation prediction score can be adjusted according to the prediction score.
To evaluate the predictive performance and stability of N-GlycoGo, the protein sequence in the independent set was used for prediction, Sn, Sp, ACC, and MCC were calculated based on the prediction results, and the existing glycosylation site predictions were used for comparison. The accuracy of N-GlycoGo for the independent set in human is shown in Table 4. The MCC value is 0.397, which is tied for first place with GlycoEP_A_BPP. The accuracy of the independent set for mouse is shown in Table 5. GlycoEP's BPP has the highest MCC value of 0.766, followed by N-GlycoGo's 0.719. But the performance of GlycoEP in 6 different prediction models with large variation. The average MCC of GlycoEP is only  0.382. It may be difficult for users to choose a suitable prediction model. NetNGlyc is the earliest glycosylation prediction tool. The early data is relatively incomplete and so the MCC is the lowest. GPP is also an early prediction tool. GlycoPP targets prokaryotes and does not perform well for eukaryotes, such as humans and mice. The glycosylation sites of different species are different. GlycoEP provides multiple prediction models and the differences between the models are very large. Sprint-Gly establishes prediction models for mice and humans, whereas N-GlyDE establishes prediction models for humans. Sprint-Gly and N-GlyDE were released in 2019 and have better performance than other tools.

IV. CONCLUSION
N-GlycoGo is based on the ensemble learning model. It uses information from human and mouse N-linked glycosylation sites and considers sequence based features, structure based features, and function based features. A total of 11 feature codes are present. The best model is integrated with the relevant information. First, Binary, AAindex, AAC, CKSAAP, Kmer, PC-PseAAC, SC-PseAAC, Motif, RSA/ASA, SS, and SignalP are encoded by 21 window size amino acid fragments; the results are predicted using various integrated models through tenfold cross-validation and XGBoost performed best with an MCC of 0.968. Using the independent set evaluation model compiled by dbPTM and O-GlycBase, which is different from the training data set, XGBoost can also reach an MCC of 0.397 and 0.719 in human and mouse, respectively. Therefore, XGBoost is used as the basic model for N-GlycoGo prediction.
The independent set was used for existing glycosylation site prediction websites, including NetNGlyc, GPP, GlycoEP, GlycoPP, SPRINT-Gly, and N-GlyDE. For accuracy evaluation conducted using the independent set for human. The MCC values of N-GlycoGo and GlycoEP_A_BPP are tied for first place. For accuracy evaluation, conducted by using the independent set of mouse, all performance of other tools were lower than that ofN-GlycoGo except GlycoEP's BPP. N-GlycoGo was much higher than the average MCC of different models of GlycoEP.
N-GlycoGo was developed by strictly analyzing and integrating the best architecture in each step for glycosylation site prediction in human and mouse. It will help researchers to reduce time greatly and predict accurately.   YEN-WEI CHU received the Ph.D. degree in computer science from National Chiao Tung University, Taiwan, in 2006. His lab takes the technologies of data mining, machine learning, and artificial intelligence as the core algorithms. To establish a variety of intelligent decision-making systems for different issues, which is the important part for industry 4.0 and the Internet of Things. He is currently a Professor with the Institute of Genomics and Bioinformatics and the joint appointment Professor with the Institute of Molecular Biology, National Chung Hsing University, Taiwan. He mainly focus on building the learning model of humanoid intelligence, covering the field of bioinformatics, medical science, agriculture, food science, business, and astronomy. His research interests include bioinformatics algorithms, computational epigenetics, artificial intelligence, and intelligent systems.