Mal-Light: Enhancing Lysine Malonylation Sites Prediction Problem Using Evolutionary-based Features

Post Translational Modification (PTM) is considered an important biological process with a tremendous impact on the function of proteins in both eukaryotes, and prokaryotes cells. During the past decades, a wide range of PTMs has been identified. Among them, malonylation is a recently identified PTM which plays a vital role in a wide range of biological interactions. Notwithstanding, this modification plays a potential role in energy metabolism in different species including Homo Sapiens. The identification of PTM sites using experimental methods is time-consuming and costly. Hence, there is a demand for introducing fast and cost-effective computational methods. In this study, we propose a new machine learning method, called Mal-Light, to address this problem. To build this model, we extract local evolutionary-based information according to the interaction of neighboring amino acids using a bi-peptide based method. We then use Light Gradient Boosting (LightGBM) as our classifier to predict malonylation sites. Our results demonstrate that Mal-Light is able to significantly improve malonylation site prediction performance compared to previous studies found in the literature. Using Mal-Light we achieve Matthew’s correlation coefficient (MCC) of 0.74 and 0.60, Accuracy of 86.66% and 79.51%, Sensitivity of 78.26% and 67.27%, and Specificity of 95.05% and 91.75%, for Homo Sapiens and Mus Musculus proteins, respectively. Mal-Light is implemented as an online predictor which is publicly available at: (http://brl.uiu.ac.bd/MalLight/)


I. INTRODUCTION
Post-translational modifications (PTMs) are the key tools for regulating numerous biological processes that are affiliated with the control activities of various cells and diseases [1]- [4]. PTMs are formed after the translation process of proteins from the mRNA sequences when they are elucidated [5], [6]. PTMs are the major components The associate editor coordinating the review of this manuscript and approving it for publication was Larbi Boubchir . of biological processes for genetic code proliferation and cellular physiology regulation. So far, more than 620 varieties of PTMs [7] have been identified. Lysine is one of the most widely modified residues among the 20 types of natural amino acids through PTM [8]. It has been associated with numerous PTMs including glycation [9], succinylation [10], [11], methylation [12], [13], acetylation [14], and sumoylation [15]. Among them, Lysine malonylation (Kmal) is a recently identified PTM type that is evolutionarily conserved, which is associated with several FIGURE 1. The General architecture of Mal-Light. In Mal-Light sequences were yield from a public database and features were generated by our method, named bi-peptide based evolutionary feature extraction approach with the classifier and the classification algorithm was evaluated by using both 10-fold cross-validation and an independent test set.
biological processes in both eukaryotic and prokaryotic cells. Lysine malonylation plays a vital role in a wide range of biological interactions [16]. It has also been found in histones with functions related to gene expression, and chromosome configuration. Thus, identification of malonylation sites can provide detailed insights into the functionality of proteins and their biological interactions. The affluence of malonylated proteins impact on metabolic pathways and notably those adhering to fatty acid metabolism is explained in [17]. In addition, newly identified malonylated sites have been found to be associated with monitoring the conditions in the pathological, and physiological functional structures such as control of appetite and muscle contraction [18], [19].
The foremost techniques for identifying the Kmal sites are experimental methods such as mass spectrometry. However, these methods are costly and time-consuming. In recent years, the identification of PTM sites using a fast and accurate computational method attracted tremendous attention [20]. To identify PTM sites in the protein sequences, various bioinformatics techniques have been suggested [11], [21]- [29]. Among those studies, a five-step rule was proposed in [30], to design an efficient computational predictor for solving these biological problems which have been widely referred and followed in other studies [27], [28], [31]- [33]. The following steps comprise: (1) curated the dataset manually or construct the dataset in some valid way to randomly split in both training and testing for the predictor, (2) transforming the biological sequences into numerical values to extracting the feature vector, (3) selecting proper algorithm according to the problem and develop an algorithm to build the predictor, (4) validate the statistical performance matrices and evaluate the predictor enhancement, and (5) design a user-friendly predictor and deploy the method as a web server application publicly. The above-mentioned process is explained in Fig. 1 in the subsequent section.
Among computational approaches, predicting the malonylated sites through the Machine Learning (ML) models has attracted the most attention [25], [34]- [40]. The first computational scheme developed by Xu et al. [35], called Mal-Lys, to predict the Kmal sites based on the protein sequences. They extracted three types of features in Mal-Lys based on position-specific amino acid dehydration, sequence order information, and physicochemical properties. They also used maximum relevance minimum redundancy for feature selection task [36]. They also used Support Vector Machine (SVM) as their classifiers to build Mal-Lys. At the same time, Wang et al. [37] manifested an SVM-based classifier, named MaloPred to predict malonylation sites in three different species (Homo sapiens, Mus musculus, and Escherichia coli). In a different study, Xiang et al. [38] trained an SVM model by introducing a new computational method using the Pseudo Amino Acid Composition (PseAAC) scheme to extract features. Their study also validated the diverse pathways and biological processes in several species. In another study, Zhang et al. [39] extracted the characteristics and key patterns from the residue sequences of Kmal sites using 11 different feature encoding methods. Among them, they identified the optimized feature and used Light Gradient Boosting Machine (LightGBM) as their classifier to predict Kmal sites for Homo sapiens, Mus musculus, and Escherichia coli samples.
In a different study, Jianhua et al. [11] developed a new predictor, named pSuc-Lys, by using a feature extraction technique called, PseAAC. To build this model, they combined a vectorized sequence-coupling model into the common form of PseAAC along with using ensemble random forest technique as their classifier. At the same time, Taherzadeh et al. [40] introduced a new machine-learning approach named SPRINT, which is conceived of sequencebased prediction of protein-peptide binding sites directly from protein sequence by using Support Vector Machine. Later on, Taherzadeh et al. [41] also proposed SPRINT-Mal for the Kmal site prediction problem. To build this model, they implicated both sequence-based as well as structuralbased features and used SVM as their classifier. They obtained promising results in predicting the malonylation sites for mouse samples. Most recently, Zhe et al. [42] proposed a new SVM base method, entitled CKSAAP_FormSite, to solve the class imbalance problem in the prediction of formylation sites prediction task. They have applied a composition of k-spaced amino acid pairs (CKSAAP) feature extraction technique that were utilized to encode each peptide during training.
Despite all the efforts that have been made so far, the Kmal prediction accuracy has still remained limited. In this paper, we propose a new model called Mal-Light, based on the concepts of a bi-peptide based evolutionary feature extraction strategy for enhancing the performance of malonylated sites [43], [44]. We then investigated the performance of 12 different classifiers on our extracted features to identify the best one to build Mal-Light. Among these classifiers, Light Gradient Boosting (LightGBM) obtained the best results. As a result, we use this classifier to build Mal-Light. The above-mentioned process is shown in Fig. 1 and explained in detail in the subsequent section. In fact, our main contribution is to investigate a wide range of models that obtained promising results for different studies but have never been used for Malonylation site prediction problem to enhance the prediction performance. We compared the prediction results of Mal-Light with those of MaloPred [37], and kmal-sp [39].
We obtained Matthew's correlation coefficient (MCC) of 0.74 and 0.60, Accuracy (ACC) of 86.66% and 79.51%, Sensitivity (SN) of 78.26% and, 67.27%, and Specificity (SP) of 95.05% and 91.75% on our employed independent test set, respectively for the Homo Sapiens (Human) and Mus Musculus (Mouse) samples. Mal-Light obtained promising results by exceeding all the preceding predictors.

A. BENCHMARK DATASET
For the experimental analysis, we use malonylation data from the Protein Lysine Modification Database (PLMD) [45]. This dataset contains 9,584 malonylation and 677,865 nonmalonylation sites in 3,429 proteins belonging to mainly six species. Here we mainly focus on Homo sapiens (5,013 sites in 1,841 proteins) and Mus musculus (4,390 sites in 1,466 proteins) as the number of samples for the remaining species is extremely low. The number of samples belonging to each group and species is shown in Table 1. The responsible residue for the malonylation site is the amino acid lysine (one letter notation of K). For transforming into peptide sequence from protein, the responsible residue is kept in the middle with the window size 2ξ + 1, where ξ is the length of upstream and downstream. In the proposed model, the window size is considered as 21 (length of upstream and downstream is considered as ξ = 10). The optimal window size in the specified range is found by observing the performance of the LightGBM classifier on the features extracted using the amino acid composition technique. For ensuring the uniform length of upstream and downstream, a dummy residue, (X) has been added to any of the ends when required (for n-terminus and c-terminus amino acids that have less than 10 neighboring amino acids at each end). After that, we removed duplicated sites and extracted unique positive and unique negative from peptide sequences of all species as well as Homo sapiens (human) and Mus musculus (mouse). In the next step, to reduce redundant data of homology from the sequences we use CD-HIT [46] which have been widely used for this task. From the peptide sequence, we have found the ratio between positive and negative is quite large. As a result, we merely used CD-HIT [46] over negative sequences only where remaining the positive sequences untouched to avoid losing limited positive samples. If we applied CD-HIT [46]  over the positive sequences then the difference ratio between positive and negative sequences more increases. This is why we only apply the CD-HIT [46] over the negative sequence. It reduced the negatives sites with the similarity cut-off 40%. We then generated PSSM for our positive 9,584 and negative 14,972 samples for all the species. Besides, we have a dataset containing 5,013 positive and 12,869 negative samples for human and 4,390 positive and 10,152 negative samples for mouse. To measure the actual effectiveness of our proposed model, we generate independent test data from our original data that is unknown to the training data. In this continuum, we randomly place 90% for the training data and 10% for the independent test data, which is the same for all types of species as well as for human and mouse.

B. FEATURE EXTRACTION
Biological data are usually represented as strings of sequences. Normally strings consist of one-letter notations where each letter represents amino acids for protein and nucleotides for DNA. The string data should have to mutate into numerical values to represent the biological instances through to the classifier. This transformation which is called feature extraction can be accomplished in many ways [44], [47]- [52]. However, the information can not be preserved for all numerical values at the same level that is carried in the letter strings. To maximize the information carried by the string, different feature extraction techniques have been introduced in the literature [43], [44], [47], [48], [50], [51], [53]. Most of these studies introduced sequential-based features extracted from evolutionary-based and structure-based information [43], [44], [50], [52], [53]. Besides, some of these studies incorporate evolutionarybased and physicochemical-based information, simultaneously [47], [51]. Evolutionary-based features are most widely used and provides information on how proteins and peptides evolved or changes through mutation. While structural features provide information on the local structure of the proteins extracted from predicted secondary structure. Similarly, physicochemical-based features are extracted based on different physical, and chemical properties of the amino acids along the protein and peptide sequences. However, in almost all the cases, proposed feature extraction methods failed to extract local discriminatory information based on the interaction of the amino acids along the protein or peptide sequences. Adopting feature extraction techniques that do not preserve important discriminatory information causes low prediction performance in the classification task.
In this study, the sequential evolutionary features were used to represent each malonylated and non-malonylated lysine residue. Their 10 upstream and 10 downstream amino acids were selected to extract features as it obtained the best results compared to other windows sizes. We reflected the missing peptide outspread if a lysine residue did not carry 10 amino acids of upstream or downstream in c-terminus and n-terminus, respectively. This process is shown in detail in Fig. 2. The sequence segment P ξ ( ) consists of 10 upstream and 10 downstream residues in addition to the central lysine amino acid (K).
Here is an example of a peptide sample presented as follows, Here, ξ is an integer and indicates the amino acid lysine (K). Where denotes upstream as R −ξ , and denotes downstream as R +ξ of the peptide sample. Meanwhile, the entire peptide length and a substring of a protein sequence where the sample contains 2ξ + 1 residues. Therefore, each of the peptides specimens befalls below one of two categories, that follows, The positive malonylation segment symbolizes for P + ξ ( ) and P − ξ ( ) represents the negative malonylation segment where ∈ indicates the association of set principles.
Written the benchmark dataset as follows, In order that S + ξ ( ) carried malonylated segment, P + ξ ( ) and S − ξ ( ) carried non-malonylated segment, P − ξ ( ) where ∪ is the union operation of set principles.

C. BI-PEPTIDE BASED EVOLUTIONARY FEATURE
The bi-peptide based evolutionary concept is the feature extraction technique introduced in the prediction of lysine sites. This technique is a modification of the original sequential evolutionary feature extraction technique that is introduced in [54], [55]. It has been shown as an effective method for feature extraction in similar studies [43], [44], [49], [50]. We extract this feature directly from the Position Specific Scoring Matrix (PSSM) which contains important evolutionary information about the interaction of amino acids through mutation. The mutation is the process of sudden alterations, insertions, deletions or rearrangements of amino acids which result in the creation of diverse characteristics for the next generations. This evolution sometimes brings fairly information to nature but also sometimes causes adverse effects. Alignment is the best way to find how similar the peptide sequences are. A widely used tool named BLAST (Basic Local Alignment Search Tool) can be used for finding the alignment of the query sequence against a database. PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) [56] utilizes the concept of BLAST to create the PSSM matrix iteratively based on the cutoff e-value (E) 10 −3 (0.001).
The following procedure is used to construct a feature vector from a dataset. VOLUME 8, 2020 FIGURE 2. Schematic representation of a lysine residue and its surrounding amino acids. Figure 2.A: lysine residues with both upstream and downstream amino acids with 10 residues. Figure 2.B: Adding dummy residues in n-terminus and c-terminus to complete the window for those amino acids with less than 10 neighboring amino acids on each side. i) A peptide query sequence is indicated as P that can be shown as, PSSM is an L * 20 matrix, where L is the protein length and the 20 columns indicate the amino acids, Here, 20 is points to the diverse 20 amino acids correspond to the alphabetic form, where the length of P denotes as L, andÉ i→j refers to the amino acid responsible residue inclination at the position site 'i' that transmute to the amino acid at position site 'j' during the evolution process. ii) From Equation (5), the newly created matrix can be derived as - 77892 VOLUME 8, 2020 By means of, where,É Herein,É denotes as the mean and the following equation refers and elucidates by standard deviation, iii The above matrix is then transformed into a vector of 210 elements annotated as P evo ,

D. ADDRESSING IMBALANCED DATASET ISSUE
After cross-checking the sites from the original sequence, the ratio between the malonylation sites (positive) and the nonmalonylation (negative) sites remains largely imbalanced. By comparison, the number of non-malonylation sites is much larger than that of malonylation sites. Due to such proportions, the predictor can be precariously biased towards negative samples. It has been comprehensively studied in machine learning literature that bias-free classification can be difficult to succeed due to the data imbalance in the training data. To address this complexity, a number of balancing strategies have been proposed with regard to the data balance issue [42], [58], [63]. In this case, we can downsample the data, but this can dramatically reduce the number of available samples. Instead of excluding, we use upsample at an early stage as it was done in [60], [64], [65], so that no information for the predictor would be discarded. To handle the class imbalance problem, some studies tried to adjust learning parameters of their model. For example, [42], [60], [62] adjusted the learning parameter for the Support Vector Machine (SVM) classifier to deal with imbalance data.
In some other studies, K-Nearest Neighbors (KNN) strategy, and Neighborhood Cleaning Rule (NCR) were adopted to balance the data [59], [61], [63]. In order to calculate their Euclidean distance, they have tuned the value of k with several thresholds in simultaneous iterations. In this study, to balance our dataset using oversampling with synthetic data construction, the synthetic data must be very similar to the original data. For ensuring the small variation, we took the maximum value of all the feature vectors and found that even if the maximum value is multiplied with the constants 1.0001 or 1.0005, the new value is very much closer to the original value. As multiplying the maximum value results in small variation, multiplying with the other values of feature vectors must generate very small variations of data [61]- [63], [66]- [68]. This is how we generate our new dataset with a small variation. Therefore, we multiply 1.0001 with 9,584 positive sites (19,168 P + ξ ( ) sites) of all species, 1.0003, and 1.0005 with 5,013 positive sites (15,039 P + ξ ( ) sites) of Homo sapiens. Besides, we multiply 1.0003 with the 4,390 positive sites (8,780 P + ξ ( ) sites) of the Mus musculus, where the number of negative sites of all species is 14,972, the Homo sapiens has 12,869 negative sites, and the Mus musculus has 10,152 negative sites. Then we use the Cluster Centroid based Majority Under-sampling Technique (CCMUT) [69] to balance the positive and negative sites in the total training data. After applying, the ratio of our positive and negative number of sites in the training data is 1 : 1 (malonylation sites : non-malonylation sites). Note that positive and negative sites for total species and individual species in test data were untouched. In this way, we make sure that our balancing will not impact the generality of our results and we avoid overfitting.

E. CLASSIFICATION ALGORITHM
To identify the most effective predictor, we have investigated 12 different classifiers that performed outstandingly in numerous biological quandaries [39], [43], [57], [60], [70]- [80]. These classifiers are: Extreme Gradient Boosting (XGBoost) [39], Adaptive Boosting (AdaBoost) [43], Support Vector Machine (SVM) [57], [60], Random Forest (RF) [70], [71], Light Gradient Boosting Machine (Light-GBM) [72], [73], Linear Discriminant Analysis (LDA) [74], Quadratic Discriminant Analysis (QDA) [75], Bootstrap Aggregating (Bagging) [76], Decision Tree (DT) [77], TABLE 3. Performance comparison between our proposed method and MaloPred [37], kmal-sp [39] for predicting the malonylation sites of the individual species (Homo sapiens, Mus musculus) and total species (six species) based on the independent test. Extra-Trees (ET) [78], Gradient Boosting (GB) [79], and Multi-layer Perceptron (MLP) [80], [81]. Finally, we consider the LightGBM [73] as our classifier as it obtained the best results regarding all aspects compared to other classifiers. The comparison of the results with other classifiers is provided in: https://github.com/Wakiloo7/Mal-Light. The Light Gradient Boosting Machine (LightGBM) [72], [73] uses a tree-based learning algorithm which is known as gradient boosting frameworks. Because of its high-speed computation, 'Light' titles have been added before GBM. This algorithm uses the minor size of the memory and can handle the large data. It is recommended not to apply LightGBM [73] over small data as it is extremely sensitive because of overfitting. Implementing the LightGBM is straight forward. To implement this powerful algorithm we tune some parameters, such as num_leaves, n_estimators, and learning_rate. In here, num_leaves is a base learner maximum tree leaves, n_estimators represents the number of base trees, and the third parameter learning_rate, is basically the learning rate of boosting. In this study, the optimized values for these parameters are 31, 40, and 0.1 respectively. Alongside this, to fit the method for shrinking or adapting the learning while training, reset_parameter callback is used.

F. PERFORMANCE EVALUATION METRICS
In this study, for the purpose of the computational analysis of our results, we use Accuracy (ACC), sensitivity (SN), specificity (SP), Matthew's correlation coefficient (MCC), and F1-score(F1). All of the metrics were widely used in the literature [82], [83].
In the above equations, the TP indicates the True Positive which notifies how many peptide segments are thoroughly classified as malonylated (positive) sites. TN indicates the True Negative that means how many numbers of non-malonylated (negative) sites are thoroughly classified. Besides, the FP denotes the False Positive which represents the frequencies of non-malonylated (negative) peptide segments that are classified incorrectly as malonylated (positive), and the FN denotes the False Negative, the number of malonylated (positive) sites that were predicted wrongly as non-malonylated (negative). Alongside this, the MCC value is basically regarded as the representative of the total system for the performance. The F1-score is the weighted average VOLUME 8, 2020 or the combination of Precision as PR (also called positive predictive value) and Recall as RE (also known as sensitivity). The FP, and the FN both are taken to calculate this score. However, if anyone has gone through a data imbalance issue, F1-score is coming up with more beneficial information rather than accuracy. The outstanding predictor should be able to perform well in all above mentioned statistical measuring metrics.

III. RESULTS AND DISCUSSION
Each proposed predictor aimed at predicting the malonylated sites must have its effectiveness measure to present how well it performs. For the purpose of this study, we examine five statistical performance matrices of Mal-Light namely, accuracy, sensitivity, specificity, F1-score, and Matthew's correlation coefficient [21]- [23], [49], [82], which has been extensively used in the literature. Mal-Light comprehensive performance for predicting malonylated residues is presented for the above-mentioned five metrics.

A. ANALYSIS OF THE RESULTS FOR DIFFERENT SPECIES
Here, we report malonylation sites prediction performance for all six species specified in Table 1, and we have collected the dataset from PLMD [45]. As it was explained in the previous section, we applied 12 types of machine learning algorithms on the total and separate species that are trained using 10-fold cross-validation. Among all these algorithms, XGBoost [39], SVM [57], [60], LightGBM [73], GB [79], and MLP [80] obtained the best results. Among these classifiers, LightGBM [73] obtained the best results both for Homo sapiens and Mus musculus species. Whereas all species have been trained by Mal-Light and Homo sapiens well-trained than other species. Our results demonstrate that Mal-Light has the best performance for Homo sapiens, all species (six species), and Mus musculus, respectively in Table 2.
To investigate the generality of our model and compare our results with those reported in previous studies, we run Mal-Light on the independent test set as well. Accordingly, we train Mal-Light using training data and use it for the independent test dataset. As shown in Fig. 3, using LightGBM in average obtained better results than other classifiers. Such result is repeated in Fig. 4 for the independent test set which confirms those that are reported in Fig. 3. The consistent results achieved both for 10-fold cross-validation and independent test set demonstrate the generality of using LightGBM as the classifier to build Mal-Light.  [37], kmal-sp [39] in order to H. sapiens, M. musculus, altogether (six species), respectively.

B. PERFORMANCE COMPARISON WITH OTHER EXISTING METHODS
Malonylation has been discovered only a few ages ago. Due to its novelty, to the best of our knowledge, there are only four main tools to predict malonylation sites. These include Mal-Lys [35], which is solely trained on Mus musculus data, SPRINT-Mal [41], only to predict the malonylation sites for Homo sapiens and Mus musculus, MaloPred [37] which designed to predict the malonylation sites for three species (Homo sapiens, Mus musculus, and Escherichia coli), and kmal-sp [39], also designed to predict the malonylation sites for the same three species (Homo sapiens, Mus musculus, and Escherichia coli). Considering our targeted species, we compared Mal-Light with two of those predictors namely, MaloPred [37], kmal-sp [39] which attained the best performance and have online predictors. For the purpose of comparison, we manually transmitted all the peptide sequences to the web servers and retrieved their predictor performance for the measuring assessment. It is worth noting that, MaloPred [37], kmal-sp [39] web servers were pre-trained with some of the corresponding peptides sequences that are utilized in this study for the performance assessment as independent test set. In fact, they used all the data and trained their model and then used 10-fold cross-validation or jackknife crossvalidation to evaluate their model. Therefore, their results on the independent test set which is filtered out from the whole data may have been overestimated. In other words, the results reported for those studies on the independent test set are in fact higher than expected. Despite this, our method was able to outperform even those overestimated results.
As a result, we run some of those specific classifiers used in their style over some of those species, as well as run other types of classifier algorithms on our own training data and compare them based on the method and test dataset. Our achieved results compared to MaloPred [37] and kmal-sp [39] for Homo sapiens and Mus musculus are shown in Table 3. Results presented in Table 3, demonstrate that Mal-Light achieves better performance compared to MaloPred [37] and kmal-sp [39]. For example, SP, F1-score, ACC, and MCC prominently enhanced by 12.65%, 0.03, 3.96%, and 0.09 compared to MaloPred for the human samples, respectively. Also, SP, ACC, and MCC are 8.05%, 0.66%, and 0.02 better compared to kmal-sp for the human samples, respectively. In addition, SP increased by 12.05% and 8.05%, respectively for mouse sample compared to MaloPred [37] and kmal-sp [39]. Besides, conducting the T-test demonstrates the statistical significance of the improvement reported in this study compared to those reported in the previous studies (p-value = 0.047). It is also important to note that Mal-Light achieves ACC of 82.36% when predicting malonylation sites for all the data together (consisting of samples belonging to 6 species). These results demonstrate the effectiveness of Mal-Light compared to those previous studies proposed to predict malonylation sites in the literature. We also plot the ROC curve for the 10-fold cross-validation and independent test which is shown in Fig. 3, and Fig. 4, respectively. These figures compare the comprehensive performance between the species. In addition, here we visualize the comparison of Mal-Light with MaloPred and kmal-sp in Fig. 5 that demonstrate the results for each species as well the error bars in bar plots have also been included for better visualization in Fig. 6.

C. IDENTIFYING THE MOST EFFECTIVE FEATURES TO BUILD MAL-LIGHT
Here we also conduct a comprehensive study to investigate the impact of our extracted features for malonylation sites prediction tasks. To do this, a common approach is to eliminate the combination of features once at a time to show their relative importance in Fig. 7, which shows the impact of the 15 most important features for different species to build Mal-Light. The precision-recall curves for our experiments across all the species that are illustrated in Fig. 8. Besides, we plot the ROC curve for the cross-validation and independent test set shown in Fig. 3, and Fig. 4, which compares the comprehensive performance between the species. Furthermore, we reported a comparison which is shown in Fig. 5 with the corresponding species performance growth in the 77898 VOLUME 8, 2020 underneath of the predictor to compare with the same species in MaloPred [37], and kmal-sp [39] and the error bars in bar plots for the important result in Fig. 6.

IV. CONCLUSION
In this study, we proposed a new predictor named Mal-Light which uses PSSM concepts differently to predict Malonylation sites. Mal-Light incorporates the concept of bi-peptide to extract local features from PSSM. To build our model, we took an oversampling approach with synthetic data construction which was given as the input for the LightGBM classifier for predicting the malonylation site. Our results demonstrate that Mal-Light is able to achieve prominent performance among different species. This also demonstrates that Mal-Light is about to outperform previous studies found in the literature to predict malonylation sites using different evaluation measurements. Our aim is to investigate different window sizes along with different kinds of new evolutionary and structural-based features in our future studies to further enhance the malonylation as one of the most important PTMs.

ACKNOWLEDGMENT
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.