DeepKcrot: A Deep-Learning Architecture for General and Species-Specific Lysine Crotonylation Site Prediction

Lysine crotonylation (Kcrot), as a post-translational modification (PTM) originally identified at histone proteins, is involved in diverse biological processes. Several conventional machine-learning (ML) predictors were developed based on the Kcrot sites from histone proteins. Recently, thousands of Kcrot sites have been experimentally verified on non-histone proteins from multiple species. Accordingly, a few predictors have been developed for predicting the Krot sites for specific organisms (i.e. humans and papaya). Nevertheless, there is a lack of research on the comparison of the crotonylomes of different organisms. Here, we collected around 20,000 Kcrot sites experimentally identified from four different species as the benchmark data set. We present the deep-learning (DL) architecture dubbed DeepKcrot for predicting Kcrot sites on the proteomes across various species. DeepKcrot includes species-specific and general classifiers using a convolutional neural network with the word embedding (CNNWE) encoding approach. CNNWE performs better than both the traditional ML-based and other DL-based classifiers in terms of ten-fold cross-validation and independent test, independent of the size of the training set. Additionally, cross-species performance for each species-specific predictor is not as good as the self-species performance whereas the cross-species performance generally increases with the size of the training dataset. Moreover, the predictors developed based on the non-histone Kcrot sites are unsuccessful for the histone Kcrot prediction, suggesting that the Kcrot-containing peptides from non-histone and histone proteins have significantly different characteristics and data integration is required. Overall, DeepKcrot is an efficient prediction tool and freely available at http://www.bioinfogo.org/deepkcrot.


I. INTRODUCTION
Lysine crotonylation (Kcrot) is a conserved type of PTMs and it was originally found on histone proteins [1]. Histone crotonylation affects chromatin structure and gene expression [1]- [4]. Recently, it has been discovered on non-histone proteins and involved in various cellular activities [5], [6]. The rapid progress in the development of the state-of-the-art techniques led to the identification of thousands of Kcrot sites from different species through affinity enrichment followed by high-throughput mass spectrometry. 10,163 Kcrot sites on 2445 non-histone proteins were determined from human A549 cells [7] and 2696 Kcrot sites on The associate editor coordinating the review of this manuscript and approving it for publication was Liangxiu Han . 1024 non-histone proteins were identified from the human H1299 cell line [8]. Besides, 5995/1265/2044 different Kcrot sites were experimentally verified from Carica papaya L. (papaya)/Oryza sativa L. japonica (rice)/Nicotiana tabacum (tabacum), respectively [9]- [11]. Recently, CDYL-regulated crotonylome was investigated in Hela cell lines [12].
To understand and elucidate modification kinetics and molecular mechanisms of lysine crotonylation, a fundamental but important step is to accurately predict the crotonylation sites. Currently, five in-silico approaches were developed based on the histone Kcrot sites [13]- [17]. Although these algorithms have made great contributions to the Krot prediction, they fail to identify non-histone Kcrot sites [18]. Additionally, we collected the papaya Kcrot sites and developed a panel of classifiers for the Kcrot prediction [18].
Among the traditional ML approaches, the random forest (RF) model with the Enhanced Grouped Amino Acid Composition (EGAAC) encoding feature, dubbed RF EGAAC , had the best performance [18]. Additionally, the onedimensional Convolutional Neural Network (CNN) with the word-embedding (WE) encoding approach, named CNN WE , showed superior performance in all the models [18]. Moreover, Wang and coworkers used our collected papaya Kcrot sites and a limited number (167) of mammalian Kcrot sites for the construction of Kcrot predictors using the RF and SVM (support vector machine) architectures with the combination of different features [19]. Lv et al. [20] developed a DL model called Deep-Kcr based on experimentally verified human crotonylome [12]. These developed models were mainly based on either the crotonylome of the specific organism (e.g. papaya or humans) or a limited number of proteins (e.g. histones). With the identification of thousands of Kcrot sites from various species, it is of interest to study the diversity of crotonylomes across the different organisms and compare the performance of the developed methods and investigate whether any other model with better performance than the previously developed models.
In this study, we constructed the Long Short-Term Memory (LSTM) model and compared it with our previously developed models including RF EGAAC and CNN WE models. We found that the CNN WE model still showed the best performance. Additionally, the CNN WE model compared favourably to the reported model Deep-Kcr. Moreover, we constructed DeepKcrot based on the CNN WE architecture that included four orgasm-specific predictors and a general predictor. We find that cross-species performances for species-specific CNN WE predictors are not as good as the self-species performance. The general CNN WE predictor based on the integration of the training data from different species shows superior performance to the species-specific predictors except for one organism. Overall, the general CNN WE models have excellent performance for predicting Kcrot sites on proteomes across different species.

A. DATA COLLECTION AND PREPROCESSING
We collected 10,702/1265/2044/5995 Kcrot sites on nonhistone proteins from human/rice/tabacum/papaya, respectively [9]- [11]. We took the human species as an example to describe the data preprocessing. To prepare the benchmark data sets with high confidence for training and testing, we referred to the procedure established by Chen et al. [21], [22].
The 10,702 Kcrot sites from the 2836 human proteins were considered as positive sites, and the remaining lysine residues (775,123) on the same proteins were deemed as negative sites. The 2836 proteins with sequence identities >30% were classified into 2064 clusters using CD-HIT [23]. In each cluster, the protein with the largest number of Kcrot sites remained as the representative, in which the Kcrot sites were considered as positive sites and the rest lysine sites were taken as negative sites. Note that the lysine sites in the representative were removed if the aligned counterparts from other members of the same cluster can be crotonylated. According to our previous study [18], the optimal sequence window for model construction was 29. Accordingly, the dataset contained 8,170 positive sites and 76,673 negative sites from 2064 representatives. The representatives were randomly divided into two groups: 4/5 (1651) for cross-validation and the rest 1/5 (413) for an independent test. Finally, the cross-validation data set contained 6687 positives and 67,106 negatives, and the independent test dataset contained 1483 positives and 16,497 negatives ( Figure 1). The same data preprocessing was performed for papaya, rice and tabacum, respectively ( Figure 1).

B. CONVENTIONAL MACHINE LEARNING ALGORITHMS
The RF algorithm was selected and trained with the EGAAC feature by randomly generating 1600 decision trees. In the EGAAC feature, the types of amino acids were categorized into five groups (g1: GAVLMI, g2:FYW, g3: KRH, g4: DE and g5: STCPNQ) according to their physicochemical properties and the frequencies of the groups were calculated in the window of fixed length (the default value is 5) continuously sliding from the N-to C-terminal of each peptide sequence.

C. DEEP LEARNING ALGORITHMS
We constructed a DL framework based on a one-dimensional CNN with the WE encoding approach [18]. Figure 2 showed that this framework included the five layers: the input layer, the embedding layer, the convolutional layer, the fully connected layer and the output layer. These layers were described in our previous study [18]. In the embedding layer, each type of amino acid was converted into a predefined certain dimension word vector. The parameters in the vectors were updated with subsequent layers during the learning process under the supervision of a class label. We investigated the effect of the dimension size on the prediction performance (Table 1). Within the range of the dimension from three to seven, the prediction performance increased from three to five and reach the plateau starting from five. Therefore, we chose the dimension of the word vector as five.
We also constructed the Long Short-Term Memory (LSTM) model with the WE encoding approach. This model contained five layers ( Figure 3).
1) The Input Layer: Each peptide segment is converted into an integer vector with the NUM encoding approach, where each type of amino acid residues was mapped to a different integer [24].
2) The Word Embedding Layer: Each integer of the vector from the input layer is encoded into a predefined five-dimension word vector.
3) The LSTM Layer: Each of the word vectors is input sequentially into the LSTM cell that contained 32 hidden neuron units.   The architecture of CNN WE . It contained five layers. The input layer received a peptide sequence of 29 residues with K in the center. In the embedding layer, each residue of the sequence was converted into a five-dimensional word vector. In the convolution layer, 29 five-dimension word vectors were input into the CNN cell that contained 128 hidden neuron units. In the fully connected layer, 128 neuron units were built in which the ReLU was chosen for its activation function. The last layer included a single unit outputting the prediction scores.

4)
The Dense Layer: It contains a single dense sublayer that has 16 neurons with the ReLU activation function. 5) The Output Layer: This layer has only one neuron activated by sigmoid function, outputting the probability of the Kcrot modification. The parameters in the DL models were trained and optimized using the Adam algorithm. The dropout rate was set as 0.5 to avoid overfitting. We set the learning rate as 0.001, determined using the maximum number of epochs as 500. The early-stopping strategy was applied, where the training process was stopped early if the performance did not improve within 50 epochs.

D. PERFORMANCE ASSESSMENT OF THE PREDICTORS
Four measurements of accuracy (Acc), sensitivity (Sn), specificity (Sp) and Mathew Correlation Coefficient (MCC)  were calculated. They are defined as: where TP/TN is the number of the correctly predicted Kcrot/K sites, separately, whereas FP/FN is the number of the Kcrot/K sites incorrectly predicted, respectively. For each algorithm, a ten-fold cross-validation test was performed. The ROC curves were illustrated for Sn vs. 1-Sp scores and the AUC values were calculated. The area under the ROC curve with <10% false-positive rate (AUC01) was considered because it reflects the performance of the predictor in a low false-positive rate, which is significant in a practical application.

III. RUSULT AND DISCUSSTIONS A. THE CNN APPROACH WITH WORD EMBEDDING SHOWED SUPERIOR PERFORMANCE
We collected from literature 10,702/1265/2044/5995 nonredundant Kcrot sites experimentally verified from human/ rice/tabacum/papaya, respectively [7]- [11]. For each organism, we first eliminated homologous Kcrot sites and conducted the species-specific dataset. The species-specific dataset was further separated into two groups: 4/5 for tenfold cross-validation and the rest 1/5 for an independent test (see Methods for details, Figure 1). For instance, the cross-validation dataset for the human species contained 6687 positives and 67,106 negatives while the independent dataset covered 1483 positives and 16,497 negatives. As the largest dataset is derived from the human species, our study focused on the human species followed by the expansion to other species.
Many computational approaches have been developed for the prediction of PTM sites. They are generally based on different ML algorithms combined with various pre-defined features encoded from peptide sequences [25]. The RF algorithm is widely applied to the PTM prediction as it is robust and insensitive to data imbalance [24]. We ever compared the effect of the imbalanced dataset on the potential overfitting of the classifiers and found that the RF model constructed using an imbalanced training dataset had a similar performance to that built using a balanced training dataset [24]. According to our previous study on the papaya proteome, we constructed RF-based predictors with the EGAAC encoding scheme, dubbed RF EGAAC that showed the best performance in the RF-based models [18].
Deep learning algorithms have recently been applied to the field of modification prediction and have shown superior performance to traditional ML algorithms [21], [26]. We ever applied the CNN models for the prediction of the papaya . Performance comparison of the Kcrot predictors. The performances of different predictors constructed for human species were compared in terms of AUC (A) and AUC01 (B), respectively, for ten-fold cross-validation. AUC (C) and AUC01 (D) curves were also generated using the independent test. P values were calculated using a paired Student's t-test. A detailed performance comparison using different measurements is provided in Table 2.
Kcrot sites and the CNN model with the word embedding encoding approach compared favourably to other CNN-based approaches [18]. Additionally, we developed here the classifier based on long short-term memory (LSTM) with word embedding named LSTM WE , which was previously constructed to predict Cysteine S-Sulphenylation Sites and had better performance than CNN WE [24].
Among the three models, CNN WE performed the best in the prediction of human Kcrot sites for both the ten-fold crossvalidation and independent test, followed by LSTM WE and RF EGAAC (P < 5.92 × 10 −8 for CNN WE and LSTM WE ; P < 5.64 × 10 −8 for LSTM WE and RF EGAAC ; Figure 4 & Table 2). For instance, the MCC value and AUC value for CNN WE are 0.342 and 0.864 in terms of cross-validation ( Figure 4 & Table 2). As prediction performance at a low false-positive rate is highly useful in practice, we applied AUC01, in which the specificity was determined to be >90%, to the estimation of these predictors. CNN WE again compared favourably to other models in terms of the cross-validation test as well as the independent test ( Figure 4 & Table 2). Because CNN WE showed its superior performance for the Kcrot prediction in two different proteomes (ie, human and papaya [18]), we concluded that CNN WE was the best and robust model for the Kcrot prediction.
To understand the performance of the CNN WE model, we visualized the sample distributions from the outputs of the embedding layer and the last convolutional layer of the human CNN WE model using the t-SNE algorithm [27], based on the independent dataset ( Figure 5A&5B). In the word embedding layer, all the samples were mixed ( Figure 5A), whereas the positive and negative samples were separated after the convolutional operation ( Figure 5B). This comparison indicates that the distinctive features of the positives and negatives were detected by the convolutional layer, and our CNN WE model could produce a deep representation that is more discriminating than the original input sequences.
The word embedding approach is widely applied to the natural language process, in which each word is converted into a low-dimension vector. This approach avoids a sparse vector space and infers the semantic similarity of words. We applied this concept to peptide sequences. Each amino acid was converted into a five-dimension word vector in the embedding layer. Finally, a 20 × 5 matrix was generated after training where every row represented a five-dimensional 49508 VOLUME 9, 2021  The residues were grouped into four major groups: (i) the alkaline residues K and R (red colour), (ii) the amino acids with negative charged side chains D and E (blue colour), (iii) the hydrophobic amino acids F, L, M, I, V, W, Y and A (green colour); (iv) the mainly polar uncharged residues T, Q, S, G, H and N (purple colour).
word vector of the amino acid. Based on the matrix, we investigated the similarity of amino acid residues around the Kcrot sites. The 20 amino acids were hierarchically clustered using Euclidean distance in average linkage ( Figure 5C).
The amino acids were distributed into four major clusters: (i) the alkaline amino acids K and R, (ii) the amino acids with negative charged side chains D and E, (iii) the hydrophobic amino acids F, L, M, I, V, W, Y and A, (iv) the mainly VOLUME 9, 2021 FIGURE 6. Impact of the training set data size on the prediction performance of independent test sets. The AUC (A) and AUC01 (B) curves were generated using five different data sizes: a sixteenth, an eighth, a quarter, a half, and the whole independent dataset from the human species. The whole dataset contained 6,687 positive peptides and 67,106 negative peptides. polar uncharged residues T, Q, S, G, H and N. The special amino acid C and P were separated from these clusters. This clustering is similar to the classification of 20 amino acids according to their physicochemical properties, indicating that physicochemical properties are important as the features of classification and our model is capable of elucidating the significance of the correlation between amino acid properties.

B. ESTIMATION OF THE IMPACT OF DATA SIZE ON PREDICTION ACCURACY
The performance of an ML algorithm is generally sensitive to the size of the training data. We previously constructed the predictors for lysine malonylation and found that the DL algorithm has better performance than the traditional ML approach for the large-sized dataset but it might not be true for the small-sized dataset [21]. Here, we estimated whether the previous observation existed for lysine crotonylation. We selected the predictors and compared their performances constructed based on a sixteenth, eighth, a quarter, a half of, and the whole training dataset and evaluated them using the independent dataset ( Figure 6A&6B). The overall performances of all the approaches increased with the size of the training dataset. Additionally, CNN WE performed better than the traditional algorithms (RF EGAAC ) and LSTM WE in terms of AUC and AUC01 values in the range of the data size between 1/16 (including 418 positives) and the whole (including 6,687 positives) datasets. On the contrary, LSTM WE had an inferior performance compared to RF EGAAC when the data size is the smallest whereas the former compared favourably to the latter when the data size increased. These observations indicate that the performances of the DL models are largely affected by the data size compared with the RF models and CNN WE is a robust and reliable model with high performance.

C. EVALUATION OF SPECIES-SPECIFIC CNN WE MODELS AND THEIR CROSS-SPECIES PERFORMANCES
The lysine crotonylation sites have been investigated from four different species, including human and three plant species (i.e. papaya, rice and tabacum). The number of identified Kcrot sites ranged from 1265 for the rice organism to over 10,000 for humans. We developed the classifier human-specific CNN WE and found that CNN WE outperformed other predictors for both large-sized and small-sized datasets ( Figure 6). As the numbers of positives identified in the three plant species are within this range (Figure 1), we did not repeat our analyses performed above for the three plant species and constructed the CNN WE predictors directly for these species, using the same data processing method as the human dataset ( Figure 1). All the species-specific classifiers have the AUC/AUC01 values larger than 0.838/0.033 using species-specific independent test sets, respectively (Figure 7). The crotonylation is catalyzed by crotonyltransferases. Some of them are evolutionarily conserved such as MOF that is found in both yeast and human [28] while others are not such as CBP and p300 that only exist in mammalian cells. Therefore, the enzymes from different species may have diverse characteristics and the produced Kcrot sites may have different features. To compare these modifications between different species, we interrogated the cross-species performance of CNN WE . The test dataset was the independent test dataset from each species. Expectedly, the crossspecies performances for each species-specific predictor are not as good as the self-species performance in terms of AUC and AUC01 values (Figure 7). For instance, the ricespecific model had the AUC value of 0.858 whereas other specific models only had the AUC value with the range of 0.76 to 0.80 (Figure 7). Additionally, we combined the training datasets from all four species and constructed a general CNN WE classifier. The general predictor based on the large dataset had better performance than cross-species performance. Furthermore, the general predictor compared favourably to a few species-specific predictors (Figure 7). For example, the general classifier had the AUC/AUC01 value of 0.890/0.0377 for the papaya test set while the value reduced to 0.878/0.0355 for the papaya-specific classifier, respectively. However, the general classifier also showed inferior performance to the Tabacum-specific classifier in terms of the Tabacum independent test set. The former had the AUC/AUC01 value of 0.833/0.0319 whereas the latter had the values of 0.838/0.0337, respectively (Figure 7). These suggest that the large size of the Kcrot training dataset has comprehensive coverage of the Kcrot commonality across 49510 VOLUME 9, 2021 FIGURE 7. Comparison of prediction performances for species-specific CNN WE and the general CNN WE classifier. The AUC (A) and AUC01 (B) values were calculated for self-species and cross-species Kcrot prediction using the species-specific CNN WE and the general CNN WE . The former classifier was constructed using the species-specific training dataset while the latter was developed using the combination of species-specific training datasets. The test datasets were the independent test dataset from different species (Figure 1). The species were ordered according to the number of positives used for predictor construction. different species and thus increases cross-species prediction performance, although the exceptional species (i.e. tabacum) exist (Figure 7). These observations may be consistent with the fact that some crotonyltransferases are conserved and others are not.

D. COMPARISON OF CNN WE WITH REPORTED KCROT PREDICTORS
The current classifier CNN WE was constructed using Kcrot sites on non-histone proteins as the benchmark dataset. We estimated whether CNN WE could efficiently predict Kcrot sites of histone proteins. The histone test set contained 169 positive sites and the rest 816 K-containing peptides as negative sites [15]. CNN WE failed to distinguish the positives from the negatives (AUC = 0.623, AUC01 = 0.003). It indicates that the Kcrot sites from non-histone and histone proteins have distinct characteristics. We re-constructed CNN WE by adding 4/5 (i.e. 134 positives) of known histone Kcrot peptides into the training set and considered the rest positives and all negatives (i.e. 35 positives and 816 negatives) as the test set. The new CNN WE model showed improved accuracy for the histone Kcrot prediction (AUC = 0.966, AUC01 = 0.081). The larger AUC value for the histone Kcrot prediction than the AUC value for the non-histone Kcrot suggests that the histone Kcrot sites have common features whereas the features of non-histone Kcrot sites are relatively diverse. In summary, it is necessary to build CNN WE by including Kcrot sites from histone proteins.
We compared our CNN WE architecture with the Deep-Kcr model. As the developers of the Deep-Kcr model shared the training data set and the independent test set through https://github.com/linDing-group/Deep-Kcr, we developed the CNN WE model using the same dataset and evaluated it through ten-fold cross-validation and independent test. The average AUC values were 0.914 ± 0.007 and 0.924 ± 0.002 in terms of ten-fold cross-validation ( Figure 7A) and independent test ( Figure 7B), respectively. Both AUC values are significantly larger than the counterparts of Deep-Kcr (0.885 and 0.859). Therefore, the CNN WE architecture compared favourably to Deep-Kcr.
We further compared our papaya-specific CNN WE model with the models developed by Wang et al.. [19]. All the models were based on the same papaya dataset (3453 positive and 37,134 negative sequences). We separated the data into the cross-validation dataset (2742 positives and 29,676 negatives) and independent test dataset (711 positives and 7458 negatives), whereas Wang et al. generated the training dataset (2548 positives and 2548 negatives) and the testing dataset (669 positives and 6720 negatives) [19]. Please note that the sum of positives in Wang's datasets is 3217, which is smaller than 3453. It may be due to the filtering of the sequences of length less than 31 amino acid compositions or those containing uncertain composition, described by Wang et al. As our independent dataset was larger than Wang's testing dataset, we randomly selected from our independent test dataset 669 positives and 6720 negatives for ten times as the test dataset for the evaluation of our CNN WE model. Wang et al. developed several models with different features and found that the models with the incorporated and selected features had the best performances. According to Table 4 [19], the RF model with the LGBM selection method (LGBM_RF) and the SVM model with the MRMD selection method (MRMD_svm) had the best performances. Therefore, we selected these two models for comparison. Table 3 showed that DeepKcrot had the largest AUC value and had the largest sensitivity when specificity was fixed at 0.72. In summary, the Papaya-specific CNN WE model compared favourably to the models developed by Wang et al..

E. CONSTRUCTION OF THE ONLINE KCROT PREDICTOR
We developed an easy-to-use online tool for the prediction of the Kcrot sites, dubbed DeepKcrot. DeepKcrot contained four species-specific CNN WE predictors and a general CNN WE classifier. The users could select the general model or species-specific model at the input interface and input the query protein sequences directly or upload the sequence file. The prediction results are output in tabular form with five columns: sequence header, position, sequence, prediction score, and prediction result that was colour-coded with at the specificity levels of 80, 90, and 95%, respectively.

IV. CONCLUSION
The common PTM prediction approaches are based on ML that requires experts to pre-define informative features. They have been widely applied to the prediction of lysine crotonylation based on the Kcrot sites on histone proteins. Recently, thousands of Kcrot sites have been identified on non-histone proteins from different species but it is unclear whether lysine crotonylation on these proteins could be effectively predicted. In this study, we compiled a benchmark data set of known Kcrot sites and evaluated the performance of different machine-learning approaches, including deeplearning algorithms. We found that the DL-based classifier CNN WE had the best performance compared with the traditional ML model and the LSTM WE model that showed superior to CNN WE for the prediction of cysteine sulphenylation sites, even for the limited training dataset [24]. It suggests that CNN and LSTM may have distinct characteristics that are feasible to extract different PTM features. Furthermore, these models were compared using different sizes of the training data set and CNN WE again shows the best performance, suggesting its superior performance and robustness. We also compared CNN WE with the recently reported Deep-Kcr model and CNN WE showed better performance. Additionally, the CNN WE model constructed based on non-histone Kcrot sites failed to predict the histone Kcrot sites, suggesting the histone and non-histone Kcrot sites may have different features. Accordingly, we reconstructed the CNN models by including the histone Kcrot sites. Moreover, we constructed four organism-specific CNN WE models and found that crossspecies performances for species-specific CNN WE predictors were not as good as the self-species performances. It suggests that the crotonylome for each organism has its specific features. The general CNN WE predictor based on the integration of the training data from different species showed superior performance to the species-specific predictors except for one organism. Taken together, we developed the first DL architecture DeepKcrot for predicting Kcrot sites on proteomes across various species. The outstanding performance of the DL algorithms in the prediction of Kcrot sites suggests that DL may be applied broadly to predicting other types of PTM sites.