PCSPred_SC: Prediction of Protein Citrullination Sites Using an Effective Sequence-Based Combined Method

,


I. INTRODUCTION
Post-translational modifications (PTMs) can increase the diversity of protein functions to maintain physiological homeostasis [1]. As one of critical PTMs, protein citrullination illustrated in Figure 1 is a hydrolytic reaction converting positively charged arginine into neutrally charged citrulline [2]. Mediated by the calcium-dependent peptidyl arginine deiminases (PADs) [3], citrullination can alter total charge and hydrogen bonding with consequent effects on the target protein's molecular conformation, biochemical activities, immunogenicity, and interactions with proteins or nucleic acids [4].
The existing PAD isozymes (PAD 1-4 and 6) exhibit a tissue specific expression [5]. Under normal circumstances, PAD1 and PAD3 are mainly expressed in the skin and The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu. hair follicles. They participate in the terminal differentiation of keratinocytes by catalyzing the citrullination of pro-filaggrin [6]. PAD2 that may function within the epidermal growth factor signaling pathway to regulate cell TABLE 1. Known substrates that are targeted by the individual isozymes of the PAD family of isozymes [10]. migration is principally distributed in skeletal muscle, brain, pancreas and spleen [7]. PAD4 involved in gene expression and protein localization can be primarily detected in neutrophils and other myeloid derived cells [8]. PAD6 closely related with embryo development is largely located in oocytes and embryonic stem cells [9]. As listed in Table 1, the PAD isozymes also show specificity for their targeted substrates.
At physiological concentrations of calcium, protein citrullination is crucial in a diverse array of cellular processes including proliferation, differentiation, apoptosis, myelinization, neutrophil extracellular trap (NET) formation, gene expression regulation, and skin homeostasis [5], [11]- [13]. For instance, citrullinated keratin and histones have important effects on skin protection and gene regulation [14]; Citrullinated fibronectin can regulate the function of synovial fibroblasts [15]; Citrullinated vimentin and its antibody can induce osteoclast differentiation and subsequent bone resorption [16]; Citrullinated calreticulin on the cell surface can enhance its role in signaling pathways [17]. Although T. Goulas et al. shed light on a general regulatory mechanism of citrullination [18], the exact factors that lead to citrullination in vivo remain largely elusive [13]. To illuminate the reaction details of citrullination, development of novel strategies to comprehensively identify protein citrullination sites (PCSs) is urgently needed.
Recently, accumulating evidence has indicated that dysregulations of PADs in citrullination are involved in a slew of human pathology [10], [13], [19]- [21]. As reported, abnormally elevated protein citrullination followed by the production of anti-citrullinated protein antibodies (ACPAs) was detected in patients with rheumatoid arthritis (RA) [22]. Citrullinated histone H3, a biomarker of NET formation, is independently associated with the occurrence of venous thromboembolism in cancer patients [23]. PAD1-mediated citrullination is positively correlated with human triple negative breast cancer [24]. Overexpressing PAD2 has been implicated in the onset and progression of human malignant cancers [12], [20], [25], [26], whereas downregulation of PAD2 is observed in the pathogenesis of colorectal cancer [27]. Additionally, PAD4 has been linked to a wide range of inflammatory autoimmune diseases including arthritis, colitis and multiple sclerosis [28], [29]. Given the strong evidence linking dysregulated citrullination to human diseases, autoantibodies targeting citrullinated proteins have been used as promising diagnostic markers [5], [12], [28]. However, the pathological roles that PAD-mediated citrullination play in these diseases are still to be discerned [10]. Given this background, accurate identification of PCSs is required to broaden our knowledge of PAD's substrate specificity and clarify the critical effect of citrullination on substrate's functions, which will ultimately have diagnostic or prognostic value in diverse citrullination related diseases [30].
At present, a series of experimental methods have been developed to detect PCSs [31]. S.M. Hensen et al. proposed a robust and sensitive antibody-independent strategy to visualize the modified citrullines through western blot analysis [32]. Using the ionization characteristics of citrulline residues, detection of citrulline by mass spectrometry (MS) is a widely adopted technique. However, the abundance of citrulline peptides is too low to produce high-quality MS/MS fingerprints, and related non-citrulline fragments are easily deleted [33]. Furthermore, as citrullination results in only 1 dalton change in mass, ion signals of a citrullinated peptide in a MS are always difficult to detect [34]. Therefore, only a handful of PAD substrates are known due to the technical challenges associated with experimental methods [22].
With the advances in sequencing technologies, costeffective computational methods have been proposed to accelerate the discovery of PCSs. By incorporating multiple sequence information such as amino acid composition, position-specific scoring matrix (PSSM) conservation scores, amino acid factors and disorder scores, Q. Zhang et al. employed a random forest classifier together with the maximum Relevance Minimum Redundancy (mRMR)incremental feature selection (IFS) method to predict PCSs [35]. However, the sensitivity achieved by the predictor is as low as 0.603 due to the unsolved class imbalance problem in the dataset. Stimulated by the pseudo amino acid composition (PseAAC) approach [36], a sequence-based predictor called CKSAAP_CitrSite was proposed to improve the prediction performance by coupling support vector machine (SVM) with the composition of k-spaced amino acid pairs (CKSAAP) selected by F-score [37]. Likewise, CKSAAP_CitrSite did not give a solution to the class imbalance problem. In addition, as the feature extraction strategy is based on a single technique, the intrinsic biological properties of protein citrullination are not fully considered, which may limit the prediction performance of CKSAAP_CitrSite.
The aforementioned methods have made certain contributions to stimulating the development of PCS detection. But there is still room for improvement, particularly in terms of sensitivity. In view of the limitations of the above-mentioned methods, this study proposes a novel and powerful method named PCSPred_SC for identifying PCSs using an effective sequence-based combined method. Firstly, different feature extraction methods including binary encoding (BE), position specific amino acid propensity (PSAAP), pseudo amino acid composition (PseAAC) [36] and physicochemical properties (PP) [38], [39] are adopted to convert peptides into numeric feature vectors. Secondly, under the complete feature space, the PCS predictors are respectively constructed by various prediction algorithms, including naïve bayes (NB), logistic regression (LR), artificial neural network (ANN), decision tree (DT), random forest (RF), and support vector machine (SVM). Thirdly, the effects of the over-sampling methods including random over-sampling (ROS), synthetic minority over-sampling technique (SMOTE) [40], Border-line SMOTE [41], SVM-SMOTE [42] and Adasyn [43] are systematically explored using the top 2 prediction algorithms that achieve the best performance. Finally, to determine the best prediction model, different feature selection methods including mutual information (MI) [44], autoencoder (AE) [45], and t-distributed stochastic neighbor embedding (t-SNE) [46] are respectively incorporated into the top 2 models constructed by a combination of the prediction algorithm and over-sampling methods. Compared with exiting methods, experimental results demonstrate that the proposed method achieves a superior performance in terms of various performance measures. A summary of the computational framework of our method is displayed in Figure 2.

A. DATASET
To make fair comparisons with previous studies, we use the same trainning dataset introduced by Q. Zhang et al. [35]. The dataset was generated by scanning across the protein sequence within a window of size 21 centered at citrullination or non-citrullination sites. If less than 10 upstream or downstream residues flanked the central site, the missing positions would be filled with a dummy residue 'X'. As a result, the dataset included 116 experimentally annotated PCSs and 232 non-annotated PCSs.
To further reliably estimate the predictive ability of the proposed method, we construct an independent test dataset as follows. The citrullinated proteins are collected from Universal Resource of Protein (UniProt, available at https://www.uniprot.org/) by searching the keyword of 'citrulline' in the field of 'Modified residue'. If the number of flanking amino acids is less than 10, the missing positions are expanded with a special residue 'X'. Then, the peptides within a window of size 21 centered at the citrullination site are extracted. Among all the retrieved sequences, only experimentally identified and reviewed citrullination sites are kept. Furthermore, all duplicate samples and the samples included in the training dataset are removed. As a result, a positive dataset including 138 samples with citrullination sites is obtained. We then randomly select 150 non-citrullination sites from the citrullinated proteins to construct a representative negative dataset. The same strict filtering criteria, as mentioned above, are applied to the negative dataset. Thus, the independent test dataset has a total of 138 + 150 = 288 peptide samples.

B. FEATURE EXTRACTION
For constructing a robust and reliable predictor, it is a crucial step to transform the input sequence into a set of numerical attributes that could really reflect the intrinsic correlation with the desired target [47]. To avoid the bias of using single descriptor, integrating complementary information from different types of protein feature representations has become a new trend of feature design [48], [49]. In this study, we explore four types of quantitative feature VOLUME 8, 2020 descriptors, including binary encoding (BE), position specific amino acid propensity (PSAAP), pseudo amino acid composition (PseAAC), and physicochemical properties (PP). The detailed feature extraction processes are explained in the following subsections.

1) BINARY ENCODING
20 amino acids plus the aforementioned gap-filling residue ''X'' are ordered as ACDEFGHIKLMNPQRSTVWYX. According to the alphabetical order, the values of j = 1, 2, · · · , 21 denote different kinds of amino acids. We encode amino acid j at each position using a 21-dimensional binary vector {a 1 , a 2 , · · · , a 21 }, where a j = 1 and a i (i = j) = 0.

2) POSITION SPECIFIC AMINO ACID PROPENSITY
The position specific amino acid propensity (PSAAP) would be employed to measure the amino acid preferences in different positions flanking the known PCSs. Given a peptide P in the dataset, its most straightforward expression is where R i represents the i-th residue of the peptide P. The detailed procedure of PSAAP is as follows. Firstly, the amino acid compositions of the j-th position for the positive dataset and the negative dataset are respectively calculated and denoted as Then, a score z i,j = A + i,j − A − i,j is computed to indicate the propensity of the i-th amino acid in the j-th position of the peptide centered at PCSs. Finally, a 21 dimensional vector for every peptide can be easily read out from the PSAAP matrix Z = (z i,j ), where the vector's i-th element µ i is denoted as

3) PSEUDO AMINO ACID COMPOSITION
To avoid losing sequence order information hidden in protein sequences, the pseudo amino acid composition (PseAAC) proposed by KC Chou [36] is introduced to comprehensively incorporate the occurrences and physicochemical properties of amino acids. Ever since then, the concept of PseAAC has been penetrated into various areas of computational proteomics [47], [50], [51].
Considering the peptide given in Equation (1), the sequence-order correlation factor is defined as where M (R i ) indicates the normalized side-chain mass of the amino acid R i and can be subjected to a standard conversion as described by the following equation: where M 0 (i) is the original side-chain mass of the i-th amino acid in alphabetical order. Then, the peptide given in Equation (1) is represented as where the components are given by and f u (u = 1, 2, · · · , 20) is the normalized occurrence frequency of the 20 amino acids in the peptide sequence P; Without loss of generality, the weight factor ω is set to be 0.05. In this representation, the first 20 descriptors depict the components of its basic amino acid composition and the last descriptor reflects sequence order information.

4) PHYSICOCHEMICAL PROPERTIES
Several studies have indicated that the physicochemical properties (PP) of residues determine its interactions with the others [38], [39]. In this study, 13 physicochemical properties closely related to the behavior of the protein interfaces, including positively charged, negatively charged, neutral charged, polarity, non polarity, hydrophobicity, hydrophilicity, secondary structure (helix), secondary structure (strands), secondary structure (coil), solvent accessibility (buried), solvent accessibility (exposed), and solvent accessibility (intermediate), are extracted from the web server named Pfeature (https://webs.iiitd.edu.in/raghava/pfeature/). Then, the average values of the amino acid's each physicochemical property along peptide samples are calculated.

C. OVER-SAMPLING METHODS
As the dataset indicated in Section II.A, the number of peptide chains without PCSs is twice that of peptide chains with PCSs. In other words, the imbalanced dataset problem exists in the benchmark dataset, which would lead to most of the incoming data labeled as the majority class by traditional machine learning algorithms [59]. In this study, over-sampling methods including random over-sampling (ROS), synthetic minority over-sampling technique (SMOTE), Border-line SMOTE, SVM-SMOTE, and Adasyn are respectively employed to balance the positive and negative training samples. The ROS method replicates randomly selected samples within the minority set; The SMOTE method generates novel synthetic samples through performing the interpolation algorithm between each minority class sample and its k minority class nearest neighbors [40]; The Border-line SMOTE method only over-samples or focuses on the borderline minority samples by identifying noise samples, danger samples, and safe samples [41]; The SVM-SMOTE method generates artificial minority instances at the boundary of majority class and minority class with SVM trained to predict future instances [42]; The Adasyn method generates more minority class samples that are harder to learn [43].

D. FEATURE SELECTION
Evidently, there always exist noisy, irrelevant, and redundant features in the integrated feature space, which can potentially cause the curse of dimensionality, over fitting, and the increase of the computation complexity [60]. That is to say, not all of these candidate features facilitate the prediction of PCSs. Therefore, mutual information, autoencoder, and t-distributed stochastic neighbor embedding described in detail below are respectively employed to select the informative features.

1) MUTUAL INFORMATION
In a nonlinear context, the mutual information (MI) is widely used as the criterion to measure the amount of information shared between different variables [44]. Suppose the set of the values of the i-th feature F i and the set of the class labels are respectively denoted as V i and C, the MI of V i and C is defined as From the perspective of information gain, MI represents the amount by which the uncertainty of C is reduced due to the introduction of F i . Greater MI means that the feature F i is more beneficial to distinguish the elements in C.

2) AUTOENCODER
Implemented with unsupervised learning, autoencoder (AE) is a derivative of ANNs to reconstruct the input data at its output layer [45]. If the number of the neurons in the hidden layer is fewer than that of the input layer, dimensionality reduction of the original input patterns can be achieved by deriving features from the hidden layer. The AE learns the optimal weights connecting neurons through the backpropagation algorithm.

3) t-Distributed STOCHASTIC NEIGHBOR EMBEDDING
By matching distances between high-dimensional and low-dimensional spaces, t-distributed stochastic neighbor embedding (t-SNE) is a dimensionality reduction algorithm retaining the original clustering [46]. The whole procedure of the t-SNE is given in the following steps. (i) Calculate ''unscaled'' similarity scores between the high-dimensional points using a ''t-distribution'' and then scale them.
(ii) Construct the similarity matrix with each element representing the similarity score. (iii) Create an initial set of low-dimensional points. (iv) Iteratively update the low-dimensional points to minimize the Kullback-Leibler divergence.

E. PERFORMANCE MEASURES
To evaluate the prediction performance of PCS predictors, the widely used performance measures including sensitivity (Sn), specificity (Sp), accuracy (Acc), Matthew's correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC), are calculated. The first 4 performance measures are defined as follows: where TP, FP, TN and FN represent the numbers of true positives, false positives, true negatives, and false negatives, respectively. For the imbalanced dataset problem, there is a preference for a low Sn and a high Sp. A high Sp often means a high Acc [38]. Therefore, Acc is not an appropriate measure for the performance evaluation. To achieve a comprehensive and stable performance, the MCC reflecting a trade-off between Sn and Sp is employed as the main measure to construct the PCS predictor and compare it with existing methods.
To further evaluate the performance of our method, the receiver operating characteristics (ROC) curve is plotted with the true positive rate (i.e. Sn) as a function of the false positive rate (i.e. 1-Sp) for varying decision thresholds [61]. The AUC, a reliable measure for the prediction performance, is also calculated. 10-fold cross validation [62] is adopted in this study to calculate the above-mentioned performance evaluation indexes. That is, the benchmark dataset is randomly partitioned into 10 data subsets with approximately equal size. One subset is retained for testing and all others form the training dataset. This process is repeated 10 times to test each subset. Finally, the average performance measures over the 10 folds are calculated as the final result of evaluation.

III. RESULTS AND DISCUSSIONS A. COMPARISON OF DIFFERENT PREDICTION ALGORITHMS
The classification performance is data sensitive and algorithm dependent. Hence, the effect of different algorithms given in the Section II.C for identifying PCSs is examined using the complete feature space without over-sampling methods. In these experiments, we tune the ideal parameters for each algorithm under the 10-fold cross validation. VOLUME 8, 2020 As listed in Tables 2, the Acc achieved by RF is 0.805, which is 0.009-0.225 higher than those achieved by the other five algorithms. The closest competitor of RF in terms of Acc and MCC is the SVM. The Sp obtained by SVM has the largest Sp of 0.987, which is 0.461, 0.146, 0.134, 0.203, and 0.034 higher than that obtained by NB, LR, ANN, DT and RF, respectively. Among these algorithms, RF achieves the best MCC of 0.542, and SVM achieves the second best MCC of 0.534. These results indicate that SVM and RF attain much more outstanding performance for PCS prediction. It's also worth noting that most of the algorithms yields much higher Sp than Sn due to the imbalanced dataset problem. Therefore, SVM and RF are respectively selected as the prediction algorithm to balance the dataset using the over-sampling methods.

B. THE CHOICE OF OVER-SAMPLING METHODS
In previous experiments, the PCS predictors are constructed by the imbalanced dataset given in Section II.A. To alleviate the class imbalance problem, different over-sampling methods to balance the dataset are adopted in this study. The results in Table 3

C. ADDED VALUE OF OVER-SAMPLING METHODS
To provide insights in the added value of over-sampling methods, the prediction results without and with oversampling methods respectively given in Table 2 and Table 3 are compared. Obviously, no matter what the prediction algorithm is, the predictors with over-sampling methods perform significantly better than the variants without over-sampling methods. As listed in Table 3, all the 5 MCCs achieved by SVM combined with over-sampling methods are higher than 0.69 and 4 of them are higher than 0.76, while the MCC achieved by SVM without over-sampling methods is only 0.534. Similar comparison results can be obtained for the RF. In addition, the Sns achieved by RF and SVM without over-sampling methods are less than 0.52, and there is a relatively large gap between Sn and Sp. On the contrary, the Sns achieved by RF and SVM with over-sampling methods are higher than 0.62, while keeping the comparable Sp and Acc. These results highlights the incremental value of the over-sampling methods on enhancing the PCS predictors' reliability and performance. To further validate the effectiveness of the over-sampling methods, the statistical significance between the actual peptide chains and the generated peptide chains on the complete feature space is assessed by the paired t-test with α = 0.05. As shown in Figure 3, for the majority of features, there is no significant difference between the actual peptide chains and the generated peptide chains.

D. PERFORMANCE COMPARISONS OF FEATURE SELECTION METHODS
The feature selection methods employed in this study can be categorized into the filter algorithm (MI) and the projection algorithms (AE and t-SNE). For the filter algorithm, features are ranked according to their weights given by MI. Then, to select the optimal feature subset, MCCs corresponding to varying top-ranking features are calculated. For the projection algorithms, the prediction results of the feature spaces with different dimensions mapped by AE or t-SNE are evaluated to determine the optimal dimension of feature vector. Taking SVM as the prediction algorithm, Figure 4 illustrates the relations between the MCCs and the feature subsets selected by different feature selection methods with Adasyn or SVM-SMOTE. From the curves in Figure 4, the increasing number of features does not guarantee the better prediction performances for each feature selection method, as they may have a higher possibility of being correlated or redundancy. Table 4 provides the prediction performance of the models built with the optimal feature set for each feature selection method. The feature dimensions of the optimal feature sets in Table 4 are respectively the values of the x-coordinate when the corresponding curves in Figure 4 reach their maximums. As shown in Figure 4, the model trained with a combination of AE and Adasyn yields the highest MCC with the feature dimension being 90. As given in Table 4, the model trained with a combination of t-SNE and Adasyn yields the highest Sn and AUC with the feature dimension being 3. According to the results of Figure 4, the feature dimension will be set up to 90. In the case of the comparable performance achieved by different models, we tend to select the t-SNE + Adasyn with 3 features to significantly reduce the computational cost and the risk of overfitting. The 3 potentially important features incorporate the combinatorial information of all the features. Therefore, we should analyze the correlations of all the features and protein citrullination sites. In this study, we explore four types of quantitative feature descriptors, including binary encoding (BE), position specific amino acid propensity (PSAAP), pseudo amino acid composition (PseAAC), and physicochemical properties (PP). The BE and the PSAAP measures the amino acid preferences in different positions flanking the known citrullination sites; The PseAAC incorporates the order information hidden in protein sequences; The PP of residues is known to be important for protein interactions as it is associated with protein folding, interior packing, catalytic mechanism. These features may provide some clues for uncovering the mechanisms of protein citrullinations. Furthermore, the classification boundary of PCSs and non-PCSs in the feature space obtained by t-SNE + Adasyn is clearly visible in Figure 6. Therefore, the SVM, Adasyn, and t-SNE are respectively employed as the prediction algorithm, the over-sampling method, and the feature selection method to construct our final PCS predictor, PCSPred_SC.

E. EFFECTIVENESS OF THE FEATURE SELECTION METHODS
Feature selection is a crucial step for constructing a robust prediction model. To evaluate the effectiveness of the feature selection method, the prediction performance on the original feature set without feature selection is compared to that on the optimal feature subset with feature selection. As listed in Table 3 and Table 4, the SVM + Adasyn with MI is superior to the SVM + Adasyn without MI in terms of Acc and MCC increasing from 0.934 and 0.850 to 0.937 and 0.858, respectively. Similar conclusions can be conducted for SVM + Adasyn with AE or t-SNE. Except that the Acc  and MCC achieved by SVM + SVM-SMOTE with t-SNE is lower than those achieved by SVM + SVM-SMOTE without t-SNE, all the models with feature selection in Table 4 outperforms the models without feature selection in Table 3. These results indicate that the feature selection methods adopted in this study are effective to remove irrelevant and redundant features from the original feature space.

F. PERFORMANCE COMPARISONS UNDER DIFFERENT VALIDATION METHODS
After the predictor is completely trained using the training set, the independent testing is performed using the independent test set. In the leave-one-out cross validation, each sequence in the dataset is in turn singled out as the independent test sample and the remaining samples train the predictor. The 10-fold cross validation, the leave-one-out cross validation and the independent dataset test are respectively conducted 20 times and the corresponding performance measures are averaged to avoid over-fitting. The results in Table 5 shows that the prediction performance under the 10-fold cross validation, the leave-one-out cross validation and the independent dataset test is exactly similar, indicating the robustness and the excellent generalization ability of the proposed method.

G. PERFORMANCE COMPARISONS WITH EXISTING METHODS
To gain insights into the efficiency of the proposed PCSPred_SC, we make comparisons with the competing prediction methods, Q. Zhang et al.'s method [35] and CKSAAP_CitrSite [37] by the 10-fold cross validation. For PCSPred_SC in Table 6, the performance measures are calculated by the prediction results of the samples in the original dataset, and not including the prediction results of the samples generated by the over-sampling method. That is to say, the data used to compare the prediction performance of the proposed method with other methods is not changed. Therefore, the comparisons are relatively fair. As listed in Table 6 where the best results are highlighted in bold, all performance measures except Sp achieved by PCSPred_SC is superior to those of the competing prediction methods. Specifically, PCSPred_SC achieves the highest MCC, followed by CKSAAP_CitrSite with MCC = 0. Overall, PCSPred_SC significantly enhances the PCS prediction performance and at the same time reduces the number of features used for this task remarkably.
There are some possible factors accounting for the competitive performance of PCSPred_SC. Firstly, the feature extraction methods can capture the characteristics of PCSs, leading to more discriminative power; Secondly, the imbalanced dataset problem is solved by the over-sampling methods; Thirdly, the feature selection methods are effective to remove irrelevant and redundant features; Lastly, the combined method integrates the consistency of prediction algorithms, over-sampling methods, and feature selection methods.
Generally, overfittng occurs under the following 3 cases: (i) high-dimensional features containing noise; (ii) overtraining; (iii) insufficient training data. To reduce the influence of the overfitting problem, we have adopted feature selection methods to map the high dimensional feature space to a low dimensional feature space, while filtering out the redundant and noisy information. In addition, the traditional prediction algorithms are employed to implement the classification with some default parameters to prevent overtraining. Therefore, the insufficient training data adopted in this study and previous studies is the potential factor that may cause our model overfitting. Recent breakthrough of proteomic techniques has resulted in a rapid growth of newly discovered protein sequences. In the future work, expanding the benchmark dataset for citrullination site prediction to avoid overfitting will be an important research direction.

IV. CONCLUSIONS
In view of the significant roles of PCSs on numerous biological events and human diseases, a novel and powerful sequence-based PCS prediction method named PCSPred_SC is proposed with hybrid features integrating BE, PSAAP, PseAAC, and PP. Under the complete feature space, the PCS predictors are respectively constructed by various prediction algorithms. To solve the imbalanced dataset problem, several over-sampling methods are systemically explored. For the irrelevant and redundant features in the feature space, the feature selection methods are adopted to further enhance prediction performance. Experimental results indicate that the combination of SVM, Adasyn, and t-SNE attains much more outstanding performance for PCS prediction. When performed on the training dataset using the 10-fold cross validation, PCSPred_SC achieves excellent performance with a Sn of 0.948, a Sp of 0.931, a Acc of 0.937, a MCC of 0.862 and a AUC of 0.997, which is far better than the competing methods. Furthermore, PCSPred_SC can significantly reduce the computational and space cost by just employing 3 features. In the future work, a wider range of segmented-based feature extraction methods will be integrated into PCSPred_SC to further improve the performance. Additionally, we will construct deep learning based framework to solve deficiencies of the traditional hand-crafted features.