Prediction of Cyclin Protein Using Two-Step Feature Selection Technique

Cyclins are a family of proteins that regulate the cell cycle by activating cyclin-dependent kinases or a group of enzymes required in the cell cycle. Constructing a model to classify Cyclins is of importance to understand their function. It is urgent to construct a machine learning based model to identify Cyclins because of low similarity between the sequence of Cyclins. In this study, a method based on support vector machine (SVM) is developed to recognize Cyclins only using amino acid sequence information. 18 feature descriptors with a total of 13151-dimension features were extracted, and the feature dimension were reduced to 8 through feature selection technique. The reserved features show some of feature descriptors such as Autocorrelation, AAC and CTDC are important in the identification of Cyclins. Jackknife cross-validation results indicate our model would classify Cyclins with an accuracy of 91.9%, which is superior to a recent study using the same data set. Our work provides an important tool for discriminating Cyclins.


I. INTRODUCTION
Cyclins are a family of proteins that regulate the cell cycle by activating cyclin-dependent kinases or a group of enzymes required in the cell cycle [1]. At different stages of the cell cycle, concentrations of most cyclins alter on some level. Alterations in concentration depend on two mechanisms. One is the floating of cyclin gene expression. The other is the ubiquitin-mediated cyclin degradation pathway [2].
Cyclins can form a complex with cyclin-dependent kinases. Once the complex is phosphorylated, the cdk-active site is activated. The activated complexes in turn play a role in the cell cycle [2]. For example, the complex of Cdk1 (Cdc2a) and cyclin CycBlzm2 mediate mitosis and cytokinesis in root tip cells of corn [3]. Combining with cdk, cyclin can also generate maturation promoting factor (MPF). MPF is the trigger of phosphorylation of some proteins, which contributes to specific events of cell cycle, such as the assembly of microtubules and chromosomal reorganization [4]. Cyclin in The associate editor coordinating the review of this manuscript and approving it for publication was Dariusz Mrozek . these complexes does not have an enzyme-active site, but has a substrate binding site, and can also locate cdks to a specific subcellular structure [2].
In the post-genomic era, biological data, especially sequence data, has increasingly exploded [5]- [8]. But the traditional experimental method not merely costs a long time and a lot of money, but also has a relatively lower rate of success [9]- [13]. Therefore, a quick way is in urgent demand to identify sequences. Although existing calculation methods such as BLAST and FASTA can compare the new sequence with known proteins in common databases, these methods cannot distinguish whether it is a cyclin. Machine learning based classification is becoming mainstream in this area [14]- [25]. In the previous method, StAR [26] and other classifiers using Pseudo Amino Acid Composition (PseAAC) could effectively identify cyclins with an accuracy rate of 83.53%. Kalita et al. [16] set up a cyclin prediction server -CyclinPred based on SVM model which offered several prediction strategies. The same data set was used in our work with an improved accuracy of 94.3% through different methods. And the accuracy of Jackknife cross-validation VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ was 91.9%. We extracted a total of 18 different feature descriptors, reducing the feature dimensions to 8 after two screenings. This method can be served as a reference in future related research.

A. THE BENCHMARK DATASET
As mentioned above, the data set used in this study is the same as that of Mohabatkar's work [26], which includes 215 cyclins and 204 acyclic proteins. The training data set has a greater impact on the trained model. Too similar samples may cause overfitting of the proposed model [27]. Through BLASTCLUST, the redundant sequences were removed with a similarity of 90%. 167 cyclins and 167 acyclic proteins remained, which were used to train and test our proposed model. A benchmark dataset with low sequence identity such as 25% could produce reliable results. However, this study did not use such a stringent criterion because number of samples would be too few to have statistical significance.

B. FEATURE EXTRACTION STRATEGIES
Features are the key elements for machine learning-based protein classification issues [28]- [41]. Selecting reasonable features can effectively improve the accuracy of the classifier and the speed of model generation [42]- [45]. The following section will introduce the features utilized in this work.
here d i,i+d is the distance between the two residues at position i and i + d. The distance of any two residues is defined as Schneider-Wrede physicochemical distance matrix [46].

2) QUASI-SEQUENCE-ORDER DESCRIPTOR (QSORDER)
Quasi-sequence-order descriptor (QSOrder) also reflects the effects of sequence order [47]. For each amino acid type, a quasi-sequence-order is defined as x r at Eq. (2).
where f r is the normalized occurrence for amino acid type i and w is a weighting factor (w = 0.1).

3) MORAN AUTOCORRELATION DESCRIPTOR
Moran autocorrelation descriptor reflects the interaction between proteins and proteins [48]. Moran autocorrelation is In the equation, P means the amino acid value in a property entry of AAindex. BHAR880101 [49], CIDH920105 [50] and CHAM810101 [51] are property entries, the meanings of which are average flexibility indices, normalized average hydrophobicity scales and steric parameter respectively.P means the mean property of an amino acid sequence.

4) COMPOSITION\TRANSITION\DISTRIBUTION(C\T\D)
Compositio\transition\distribution (C\T\D) describes the global composition of an amino acid sequence, the frequencies of two different contiguous amino acids and the distribution pattern of an amino acid sequence. To calculate CTD of a sequence, the first task is to encode sequences [52]. The amino acids are divided into three classes according to their attributes [53]. And each amino acid is encoded by one of the indices (1, 2, 3) according to which class it belongs to. Hydrophobicity and charge are the other two different classifications, which are shown in Table 1. CTD including composition c r , transition t i and distribution is defined as d i,m and shown as in Eq. (4).
where N r is the number of a class. N i,j is the number of contiguous i-class and j-class. N i,m is the number of amino acids within m-th of i-th class along the sequence.

5) AMINO ACID COMPOSITION (AAC)
Amino acid composition (AAC) has been widely applied in protein classification [54]- [57] and here it is defined as the frequency of one in 20 amino acids.

6) COMPOSITION OF K-SPACED AMINO ACID PAIRS (CKSAAP)
The composition of k-spaced amino acid pairs (CKSAAP) reflects short-range correlation of residues within an amino acid sequence. The CKSAAP is defined as k-spaced residue pairs C XY in the fragment, which is illustrated as Eq. (5).
where X , Y mean a type of amino acid, N XY means the occurrence of dipeptide XY. Generally, k is always taken 0, 1, 2, 3, 4, 5.

7) DIPEPTIDE COMPETITION(DPC) AND ITS DEVIATION
Dipeptide competition(DPC) encapsulates the global information of an amino acid sequence [58], [59]. DPC is defined as D c(i) at Eq. (6). Dipeptide deviation from expected mean (DDE) is used to compute the deviation of dipeptide frequencies [60], which is determined by three parameters in Eq. (6).
where T M (i) means theoretical mean of the codon numbers of amino acids. C i1 , C i2 mean, for a given dipeptide i, the number of codons for the first amino acid and the second. D c(i) represents the dipeptide composition. And n i means occurrence of i-th dipeptide. T V (i) represents the theoretical variance of the dipeptide i. N is the length of the amino acid sequence. DDE is computed as DDE i .

8) TRIPEPTIDE COMPOSITION (TPC)
Tripeptide composition (TPC) reflects potentially significant starting points for the design of small molecule biological modulators. TPC is defined as f i at Eq. (7). N i means the number of i-th tripeptide.

9) GROUPED AMINO ACID COMPOSITION(GAAC) AND ITS EXTENSION
Grouped amino acid composition(GAAC) reflects the physicochemical properties [61]. 20 amino acids are grouped into 5 groups by their physicochemical properties (Table 2.). GAAC is computed as the composition of each group. The composition of k-spaced amino acid group pairs (CKSAAGP) is the combination between CKSAAP and GAAC.
The grouped dipeptide composition (GDPC) is defined as the combination between DPC and GAAC. Similarly, the grouped tripeptide composition (GTPC) is the combination between TPC and GAAC.
where P i is the value of i-th amino acid in a property entry of AAindex.
where P i is the value of i-th amino acid in a property entry of AAindex.

12) CONJOINT TRIAD
CTriad is proposed to describe the information of proteinprotein interactions. It regards any three sequential amino acids as a unit and computes their properties [65]. To calculate CTriad, amino acids need to be categorized to 7 classes. So there are 7 × 7×7 different units. Compute the occurrence of each unit as f i (i = 1, 2, . . ., 343). CTriad is defined as I (i) at Eq. (10).

13) GENERAL PSEUDO-AMINO ACID COMPOSITION (PAAC)
General pseudo-amino acid composition(PAAC) reflects not only the occurrence frequency of each amino acid but also the sequence order effect of an amino acid sequence [66]. PAAC consist of Xc(i) and Xc lambda (i). PAAC is defined as Eq. (11).  where θ i is a series of sequence order-correlated factors. P i is the property value of i-th amino acid. N P is the number of properties. N i is the occurrence of i-th amino acid. ω is a parameter, which is set to 0.05 here.

14) AMPHIPHILIC PSEUDO AMINO ACID COMPOSITION (APAAC)
Amphiphilic pseudo amino acid composition(APAAC) focuses on the order of amino acids in a sequence [67]. The computation of APAAC has a little difference with PAAC. APAAC consisting of Pc(i) and Pc j (i) is defined as Eq. (12). τ d reflects sequence-order information. P j (i) is property value of i-th amino acid in j-th property. Other parameters are the same as APAAC.
Different feature descriptors represent different aspects of the physicochemical properties of an amino acid sequence. In order to fully integrate these different properties, this study extracted enough features (Figure 1.) for subsequent screening.
iLearn [68] is a python programmed tool that can help researchers perform feature extraction and subsequent screening. iLearn is known for its full functionality and ease of use. In this work, iLearn was used to extract all these features.

C. FEATURES SELECTION
After the features were extracted, we removed the redundant and noise features, which may have a great impact on the model generation.
In theory, every combination of the features should be used for data. But dealing with high-dimension features, computing resources would be exhausted. Suppose that there were 100-dimension features, it would require 2 100 optimizations and models. Therefore, it is necessary to adopt some strategy for selecting optimal combinations of features. In previous studies, researchers have proposed a variety of excellent feature screening methods, such as single-factor analysis [69], Maximum-Relevance-Maximum-Distance (MRMD) [70], and Minimum Redundancy Maximum Relevance (mRMR) [71]. In this work, we used the following two screening methods.

1) INCREMENTAL FEATURE SELECTION
We sorted features in descending order according to ANOVA results and used incremental feature selection (IFS). In this way, the dimension of features drops rapidly so as that calculation speed is fast.
Analysis of variance (ANOVA) [72] was applied to rank features, not only due to its excellent characteristics, but also its short calculation time. According to the samples of two groups, Cyclin and nonCyclin, the f-score for each feature can be calculated based on the following formula.
In Eq. (13), n i represents the number of samples in the i-th group,x i represents the average value of the samples in the i-th group,x i reflects the average value of all samples, and x ij is the value of j-th sample in the i-th group. The features can be ranked in descending order according to their f-score.
Feature set starts with the first feature with the maximum f -score. And the parametersof SVM are optimized in crossvalidation. The accuracy of the feature set with these parameters was calculated. Then, the second feature in the ranked feature set was added to the above first one to produce a new feature set, the accuracy of which was also calculated by SVM. Given 100-dimension features as well, only 100 models are generated in this way, which greatly reduces the calculation time required for feature selection. After calculating the accuracy of each feature set, the feature set with the maximum accuracy, which is called SM, was obtained.

2) GREEDY ALGORITHM
Greedy algorithm [73] was used to further optimize the feature set.
The main process of greedy algorithm is as follows. As for 100-dimension features, the model is firstly trained using only 1-dimensional features. After 100 training sessions, the feature with the highest accuracy was selected. Then, another dimensional features are added to the previous features, and the model with 2-dimensional features is trained. 99 comparisons were performed to find the feature set with the highest accuracy. Repeat this step until the accuracy of new feature set is lower than that in the previously added features.

D. SUPPORT VECTOR MACHINE
In machine learning area, support vector machine (SVM) [74] is a supervised learning model, which is widely applied in  bioinformatics researches, especially in data classification and regression analysis [20], [75]- [83]. In this paper, we used LIBSVM [84] for data classification. Radial basis function was selected as kernel function. The grid search are used to optimize the kernel parameter γ and the regularization parameter C.

E. EVALUATION OF MODEL
The generated model needs to be tested to verify its performance. Among many verification methods, cross-validation is often utilized because of its objectivity [85]- [91]. Three methods commonly used to validate models are: independent verification, n-fold cross-validation, and Jackknife crossvalidation. Jackknife cross-validation is considered the best here because it is suitable for small-sample issues and can produce a unique result [92]. In Jackknife cross-validation, the sensitivity (Sn), specificity (Sp), and overall accuracy (Acc) were calculated as follows.
In Eq. (14), TP, TN, FP, FN represent the number of positive samples which are predicted as positive, the negative samples which are predicted as negative, the negative samples which are predicted as positive and the positive samples which are predicted as negative. We also used the receiver operation characteristic curve (ROC) and the area under ROC (AUC) to display the mode's performance.

III. RESULTS AND DISCUSSION
In this work, a total of 18 different types of features were extracted, which constituted 13151-dimensional features.
Based on the f -score of each feature calculated by ANOVA, we sorted the features in descending order.
After the first-step screening, the accuracy of trained models constructed for each feature set is illustrated in Figure 2.  When feature set has 130 features, the 5-fold cross-validation accuracy reaches the highest (95.2Although the accuracy is superior, the feature dimension is still high.
Thus, we further used the greedy algorithm to further reduce feature dimension. After the second-step screening, the dimension of optimal feature set dramatically dropped to eight. Using these features to train the model, we obtained an accuracy rate of 91.9% in jackknife cross-validation with the AUC of 0.9159. The accuracy of new feature subset is slightly lower than that in the first-step feature selection. However, we should note that the feature dimension decreases from 130 to 8, which could guarantee the reliable of the model and reduce the risk of overfitting. Therefore, we prefer using the 8-dimension features to construct the final model. In Mohabatka's work, the feature set has not been screened and they constructed their model using the whole 21 discrete factors [26].
In a recent study [26], Mohabatkar etc. used the same training data set as in this paper, and the accuracy of training results is 83.53%. Comparison in Table 3 shows that our proposed model is better than the published one. Furthermore, we listed the 8 optimal features in Table 4. These feature descriptors have also been described in MATE-RIALS AND METHODS. Here we explain these features briefly. As known, Schneider-Wrede is a matrix of amino acid distance, and lag25 means that the SOCNumber of this reserved feature is 25th. BHAR880101.lag22 is Moran autocorrelation using BHAR880101, where d is 22. And CIDH920105.lag5 means the same as BHAR880101.lag22. CHAM810101.lag7 is NMBroto autocorrelation using CH AM810101, where d is 7. Hydrophobicity_PRAM900101.C1 means the composition of Class 1 classified by Hydropho-bicity_PRAM900101, and Charge.C2 means the composition of Class 2 classified by Charge in C\T \D. Schneider-Wrede.Xr.L is QSOrder of amino acid L using Schneider-Wrede matrix.
It is by these 8-dimension features that we could train a model with good results. And consequently the biochemical and physical-chemical significance represented by these features has certain function for cyclin. The second and seventh features reveal amino acid G and L is relevant to the nature of cyclin. The fourth and eighth features mean that neutral charge and neutral hydrophobicity are also relevant to cyclin. Coincidentally amino acid G and L are neutral hydrophobicity and neutral charge respectively. Meanwhile the autocorrelation of amino acids is also important to cyclin according to other features.

IV. CONCLUSION
A method based on SVM was developed to classify Cyclins using amino acid sequences. Through feature extraction among 18 types of feature descriptors and feature selection method including ANVOA and greedy algorithm, 8-dimension features were reserved. SVM model constructed through the 8-dimension features was better than recent study. Jackknife cross validation demonstrated this method is helpful for classifying Cyclins.