The Feature Compression Algorithms for Identifying Cytokines Based on CNT Features

As the signaling proteins, cytokines regulate a wide range of biological functions. It is important to distinguish the cytokines from other kinds of proteins. The 188-Dimensional CNT features are presented to identify the cytokines, which contain many redundant features. In this paper, we propose three kinds of feature compression algorithms to exclude the redundant features from the 188D features and keep the accuracy of the algorithm at the same time. The three algorithms are called the genetic based algorithm, the greedy based algorithm and the brute-force based algorithm. Experimental results demonstrate that the brute-force based algorithm gets the highest classification accuracy among the three algorithms. The genetic based algorithm achieves the least number of compressed features among the three algorithms. But they consume much more time than that consumed by the greedy based algorithm. The greedy based algorithm makes a good trade-off among the three factors, which are the classification accuracy, the number of compressed features and the time consumption.


I. INTRODUCTION
Cytokines are a type of proteins, which play an important regulatory role in many cellular activities, such as differentiation, growth and interactions between cells. It has important theoretical and practical significance to study the cytokine identification and classification. The structures and functions of unknown types of cytokines can be understood by accurate recognition of the sequences of cytokines.
Based on the sequence structures and functions of cytokines obtained, authors in paper [1] identify cytokines by manual prediction. Several methods have been proposed over the last decades to identify cytokines, such as the Hidden Markov Model (HMM) based methods [2], [3], the Artificial Neutral Network (ANN) based methods [4]- [7], the Basic Local Alignment Search Tool (BLAST) [8], FASTA [9], [10], CTKPred [11] and CytoPred [12]. In paper [13], Cai et al. utilize a set of 188-Dimensional features extracted from the Amino Acids (AAs) composition to identify the cytokines. A common point of all the methods mentioned above is that they all need to extract many features from the cytokines. Are all these features necessary? As we know, the more features the identification algorithms use to identify the cytokines, The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou . the more computation resources they consume. Sometimes, irrelevant features can even reduce the accuracy of the identification algorithms. It is necessary to exclude the irrelevant features from the feature set.
Feature selection is an effective method to reduce the number of features in the feature set for classification tasks [14]- [39], which can be very difficult because of a large search space [40]. Given n features, there are 2 n possible feature subsets [41]. As the number of features increases, the feature selection problem becomes even more challenging [42]- [45]. The exhaustive strategy for searching the optimal feature subset is impossible [46], [47]. Various kinds of search strategies have been proposed, such as the complete, the random, the greedy, the heuristic search strategies [48]- [68].
In this paper, we try to compress the 188D CNT feature set [69] by removing the redundant features from it and keep the identification accuracy of the original 188D CNT feature set based method at the same time. We propose three kinds of feature compression algorithms. The first algorithm is called the genetic based feature compression algorithm. In the genetic based algorithm, a 188D binary vector (called a solution) is used to represent the 188D feature set. Each bit in the 188D binary vector corresponds to a feature in the 188D CNT feature set. If a feature in the 188D CNT feature VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ set is selected in the compressed feature set, the bit in the vector corresponding to the feature is set to 1, otherwise, the bit is set to 0. A correlation-based algorithm is proposed to produce the initial population with n solutions. Then the population is evolved for several generations by the genetic based algorithm. The classification accuracy is used as the fitness value to evaluate the quality of each solution in the population for each generation. Finally, the genetic based algorithm gets a binary vector that has the largest fitness value among all the other solutions in the population. The final compressed feature set is composed of the features, whose corresponding bits are set to 1 in the vector. The second algorithm is called the greedy based feature compression algorithm. All features in 188D feature set can be classified into 9 classes, called feature classes, according to the quantities of the AAs (20D), hydrophobicity (21D), polarity (21D), normalized Van der Waals volume (21D), surface tension (21D), charge (21D), polarizability (21D), solvent accessibility (21D) and secondary structure (21D). In the greedy based algorithm, we evaluate the correlation between a feature class and the cytokine. The feature classes are greedily added to the compressed feature set according to their evaluation results from largest to smallest. After adding a class of features to the compressed feature set, the classification accuracy of the new compressed feature set is evaluated by the Support Vector Machine (SVM). The third algorithm is called the bruteforce based feature compression algorithm. The third algorithm is also based on the feature class. A 9-bit binary vector is used to represent the 9 feature classes of the 188D feature set. Each bit in the vector represents whether a feature class is selected in the compressed feature set. The 9-bit binary vector can be thought of as a decimal number ranging from 1 to 511 (Zero is not included because it means that no feature class is selected). The brute-force based algorithm evaluates all the 511 kinds of conditions and tests the classification accuracy. Finally, the features in the feature classes, corresponding to the decimal number with the largest accuracy, are selected as the compressed feature set by the brute-force based algorithm. The contributions of the paper are as follows. First, we propose three algorithms to compress the 188D feature set to identify the cytokine proteins, which are the genetic based algorithm, the brute-force based algorithm and the greedy based algorithm. Second, extensive experiments were done to test and compare the performance of the three algorithms. The experimental results show that the genetic based algorithm achieves the best feature compression performance. While it consumes the most time and the classification accuracy of the compressed feature set is not better than that of the 188D feature set. The brute-force based algorithm has the best classification accuracy while it also consumes time. The greedy based algorithm makes a good trade-off among the classification accuracy, the number of compressed features and the time consumption.
The organization of the paper is as follows: in section 2, we introduce the data collection and data preprocessing method.
In section 3, we introduce the three kinds of feature compression algorithms in detail. In section 4, we give the experimental results to evaluate the performance of the three algorithms proposed in this paper. Finally, we draw the conclusion.

A. DATA COLLECTION AND PREPROCESSING
Cytokines regulate a wide range of biological functions including hematopoiesis, inflammation and repair by extracellular signaling. It is important to distinguish the cytokines from other kinds of proteins. Cai et al. [13] extract 188-Dimensional (188D) features based on the physicochemical properties, distribution and composition of amino acids, which are used to analyze whether a protein is a cytokine. But whether all the 188D features are necessary for the identification is a question. In this paper, we propose three kinds of feature compression algorithms to reduce the number of features contained in the 188D feature set to predict whether a protein is a cytokine. Figure 1 shows the procedure on how to collect and preprocess the data used in the three kinds of the feature compression algorithm. The whole data set is composed of two parts: the positive instances and the negative instances. To get the positive instances, we download the cytokine data set from the Uniprot database [70]- [72]. To get the negative instances, we first list the PFAM families that all positive instances belong to. For each PFAM family, except the PFAM families the positive instances belong to, we extract the longest sequence protein as the negative instance. The CD-HIT program [73] is used to remove the redundant instances from the positive and negative data sets. Finally, we get a data set with 18944 instances altogether, which contains 9645 positive instances and 9299 negative instances.

B. FEATURE EXTRACTION STRATEGY
In this paper, we want to compress the 188D features proposed in [13]. Now, we briefly introduce how to calculate 188D features [74].
As the Amino Acids possess a variety of properties, 188 features are extracted for the cytokine prediction, which is denoted as a 188D Feature Vector (FV).
The first 20 features (1-20) are denoted as FV 1 , . . . , FV 20 : where n i is the number of the 20 AAs appeared in the sequence and L is the length of the sequence [75]. Eight kinds of properties are used to extract the 168 features left from a sequence, including the hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility. 21 features are extracted according to each kind of physicochemical property. Next, we take the hydrophobicity property as an example to show how to calculate the 21 feature values (FV 21 , . . . , FV 41 ) in the 188D FV .
According to the hydrophobicity property of 20 AAs, they can be classified into three groups, which are the RKEDQN, GASTPHY, and CVLIMFW groups.
The FV 21 , FV 22 and FV 23 are calculated as follows: where CH 1 , CH 2 , and CH 3 are the size of the three groups. The FVs from 24 to 38 are calculated as follows: (FV 24 , . . . , FV 28 ; FV 29 , . . . , FV 33 ; FV 34 , . . . , FV 38 ) where the DH ij (i = 1, 2, 3; j = 1, 2, . . . , 5) represents the sequence length, where the first, 25, 50, 75, and 100 percent of AAs of the three groups are located. The FV 39 , FV 40 and FV 41 are calculated as follows: where the FH i (i = 1, 2, 3) represents the respective number of bivalent seeds that contain two amino acids from different groups and (L ∪ 1) represents the number of bivalent seeds. A total of 21 features (FV 21 − FV 41 ) are calculated for the hydrophobicity property. After the other seven kinds of physicochemical properties are analyzed in the same way as that of the hydrophobicity, we get a 188D feature vector for a cytokine.
A 188D feature vector is calculated for each cytokine in the positive and negative data set obtained by the CD-HIT program in step 3 of the data preprocessing procedure in Figure 1.
And we get a suitable data set to train the machine learning algorithm.

C. SUPPORT VECTOR MACHINE
In this paper, we use the Support Vector Machine (SVM) [76]- [97], as the classification algorithm. Given a set of instance-label pairs (x i , y i ), i = 1, . . . , n (called the training set) where x i is a n dimension vector [4], the SVM calculates the optimal solution of the following problem: By mapping the x i in the training set to a much higher dimensional space, the SVM can find a hyperplane that separates the vectors in the training set with the maximal margin in the new space. Parameter c is the penalty for the classification errors. And the kernel function is defined as φ(x i ) T φ(x j ). Four kinds of kernel functions are often used, which are the radial basis function (RBF) kernel, the sigmoid kernel, the linear kernel and the polynomial kernel.
The RBF kernel [98], [99] is used in this paper, which has two parameters c and γ . For a given problem, the optimal values of the two parameters are not known. We use a gridbased searching strategy to find suitable values for c and γ to make the classifier accurately classify the unknown data. Various pairs of (c, γ ) values are sampled from the grid searching space and the one with the highest accuracy is selected.
To learn the optimal values for the parameter (c, γ ), the whole data set obtained in section 2.1 is divided into two nonintersecting parts. The first part, called ''Optimal Parameter Searching Data Set (OPSDS)'', is used to search the optimal values for the two parameters (c, γ ) of the SVM. A stratified selection method is used to draw 10% of the data from the whole data set. The OPSDS is composed of the selected data. The stratified selection method can ensure the same class distribution in the subset as that of the whole data set. The second part called ''Testing Data Set (TDS)'', which is composed of the 90% data left, is used to test the accuracy of the SVM.

III. THE FEATURE COMPRESSION ALGORITHMS
In this section, three kinds of feature compression algorithms are introduced, which are the genetic based feature compression algorithm, the greedy based feature compression algorithm and the brute-force based feature compression algorithm. As is shown in Figure 2, all three algorithms need to learn the optimal values of (c, γ ) by using the OPSDS for any candidate compressed feature set. Then the TDS is used to evaluate the accuracy of the candidate compressed feature set by using the optimal (c, γ ) just learned. And the candidate compressed feature set with the highest accuracy will be the final compressed feature set selected by the algorithm.

A. THE GENETIC BASED FEATURE COMPRESSION ALGORITHM
In this section, the Genetic based Feature compression Algorithm is introduced. We present each component of the algorithm first, including how to represent the solution of the feature compression problem, how to construct the initial population and how to define the fitness function. Finally, we introduce the whole algorithm.

1) SOLUTION REPRESENTATION
In a genetic algorithm, a set of solutions to the optimization problem is constructed. By evolving the solutions generation by generation, a good solution can be found. The set of solutions is called Population P. We also need a kind of encoding scheme to encode each solution in P. A binary encoding scheme is used in this paper. It means that a solution x in P is a 0 − 1 vector with 188 dimensions, which is the number of features to be compressed. If an element x i in solution x, (i = 1, 2, . . . 188), is set to 1, then the i th feature is included in the compressed feature set represented by x. Otherwise, the x does not include the i th feature.

2) INITIAL POPULATION CONSTRUCTION ALGORITHM
The quality of the initial population decides the future generations, which is very important, so an initial population construction algorithm is proposed. The algorithm evaluates the value of each feature by calculating the correlation between it and the class according to the Pearson's formula (1). After calculating the worth of each feature, the algorithm constructs an individual in the initial population based on the roulette selection. The probability, calculated by formula (2), decides the chance that a feature is selected or not. It is obvious that the bigger the worth of a feature is, the larger the chance for the feature being selected in an individual of the initial population. The process is repeated M times which is the number

3) FITNESS FUNCTION
Given a solution x = (x 1 , x 2 , . . . , x 188 ) in the population P, x is used by the SVM to classify the data in the OPSDS. The classification accuracy, which is defined by the formula (3), is used to be the fitness of the solution x.
where TP is the true positives, FP is the false positives, TN is the true negatives and FN is the false negatives classified by SVM according to x.
The basic idea of the Fitness Calculation Algorithm is to evaluate the value of each solution x in population P according to the fitness function formula (3). To calculate the fitness value of a solution x, we first filter the data in the OPSDS and TDS according to formula (4) and get the filtered data set OPSDS1 and TDS1. Then the OPSDS1 is used to search the optimal value of (c, γ ) for the solution x. Finally, we classify the data in TDS1 by using the (c, γ ) just learned and the classification accuracy is the fitness value of solution x. The Fitness Calculation Algorithm is shown in Algorithm 2.

Algorithm 2 Fitness Calculation Algorithm
Input: solution x, OPSDS, TDS Output: fitness of solution x 1: For ∀d ∈ OPSDS, calculate d according to formula 4 and get the result data set OPSDS1 2: Search the optimal value for (c, γ ) based on the grid search algorithm provided by libSVM by using the OPSDS1 3: Filter the data in TDS in the same way as that of Step 1 and get the result data set TDS1 4: Classify the data in TDS1 based on the SVM by using the (c, γ ) in step2 and TDS1 5: Calculate the classification accuracy acc of the classification as the fitness of the solution x 6: return acc The Genetic based Feature compression Algorithm starts with the initial population of solutions generated by an ''initial population construction algorithm'' (Algorithm 1). Each solution represents a candidate feature subset to the Feature compression problem, which evolves several generations. During each generation, a fitness function (Algorithm 2) is applied to each solution in the population to determine their qualities. In each generation, the population is updated through crossover and mutation operators. Good solutions are selected according to the Tournament selection method. The Genetic based Feature compression Algorithm invokes the standard one-point crossover and bit mutation to update the current population. The search is terminated when the number of generations exceeds a threshold. The Genetic based Feature compression Algorithm is stated by Algorithm 3.

B. THE GREEDY BASED FEATURE COMPRESSION ALGORITHM
In section II-B, we introduce that the 188D features can be classified into 9 classes according to their physicochemical properties. 20 features belong to the quantities of the AAs.

Algorithm 3 Genetic Based Feature Compression Algorithm
Input: OPSDS, TDS Output: The compressed feature set 1: Initialization. Set the size of Population M , the number of max generations g max . Set the crossover probability p c ∈ (0, 1) and the mutation probabilities p m ∈ (0, 1). Generate an initial population P 0 by using Algorithm 1. 2: Parent Selection. Select a temporary population P t from the current population by using the Tournament selection method 3: Crossover. Make the one-point crossover operation to solutions in P t , and update P t . 4: Mutation. Make the uniform mutation operation to solutions in P t , and update P t . 5: Survival Selection. Calculate the fitness value for all solutions generated in the updated P t by calling the Algorithm 2 and set P t+1 = P t . 6: Stopping Condition. If t > g max , then terminate. Otherwise, set t = t + 1, and go to Step 2 7: return the solution in the current population who has the maximum fitness value as the best compressed feature set And 21 features belong to each of the left eight kinds of physicochemical properties.
In the greedy based feature compression algorithm, we consider all the features belonging to the same class as a feature class, so there are altogether 9 feature classes. Once a feature class is selected by the greedy based algorithm, all the features belonging to the class will be included in the final compressed feature set.
The basic idea of the greedy based algorithm is to evaluate the relationship between each feature class with the prediction results of the cytokine data set. The greedy based algorithm greedily adds the features in the feature classes to the compressed feature set one by one according to their influences on the prediction results, until all 188D features are added.
Three different kinds of methods are used to evaluate the relationship between an individual feature and the prediction result, which are the correlation based method, the Info Gain based method and the Gain Ratio based method. By adding all the evaluation results of the features belonging to the same feature class, we can evaluate the relationship between a feature class with the prediction results of the cytokine data set. The correlation based evaluation method is given by formula (1). The Info Gain based evaluation method is given by formula (5). The Gain Ratio based evaluation method is given by formula (6). The greedy based feature compression algorithm is given by Algorithm 4.
The Info Gain is calculated by the following formula: where

VOLUME 8, 2020
Algorithm 4 The Greedy Based Feature Compression Algorithm Input: OPSDS, TDS Output: The accuracies calculated for every compressed feature set 1: Evaluate each feature of 188D feature set based on formula (1), (5) or (6) and get the evaluation results set R = r 1 , r 2 , . . . , r 188 2: Calculate the evaluation result R of the class features by adding the features' evaluation results in set R together that belong to the same class feature 3: Sort the class evaluation results R from largest to smallest 4: while i < 9 do 5: Add the features belonging to the r i class feature set to the compressed feature set 6: Filter the data in OPS and TDS and get the data set OPS1 and TDS1 7: Search the optimal value for (c, γ ) by using OPS1 8: Classify the data in TDS1 based on the SVM by using the (c, γ ) 9: Calculate the classification accuracy acc of the classification and store acc into an array ACC 10: i = i + 1 11: end while 12: return ACC The Gain Ration is calculated by the following formula: where

C. THE BRUTE-FORCE BASED FEATURE COMPRESSION ALGORITHM
In this algorithm, one bit is used to represent a feature class. As there are 9 kinds of feature classes, it needs 9 bits altogether. If a kind of feature class is selected, the bit representing the feature class is set to 1, otherwise, the bit is set to 0. There are altogether 511 kinds of feature selection strategy except number zero. The brute-force feature compression algorithm enumerates all kinds of feature selection strategies and selects the one with the highest classification accuracy. According to our experimental results, the strategies with less than 6 kinds of feature classes get poor classification accuracy, so the brute-force based feature compression algorithm only enumerates the feature selection strategies who have more than 6 kinds of feature classes. The brute-force based feature compression algorithm is given by Algorithm 5.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, three experiments are done to test the performance of the three feature compression algorithms proposed in this paper. Finally, we analyze the advantage and disadvantage of different algorithms. Three evaluation standards are if the binary value of i has more than six 1-bit value then 4: Add the features belonging to the class feature set corresponding to the 1-bit to the compressed feature set 5: end if 6: Filter the data in OPS and TDS and get the data set OPS1 and TDS1 according to the compressed feature set 7: classify the data in TDS1 based on the SVM by using the (c, γ ) 8: Calculate the classification accuracy acc of the classification and store acc into an array ACC 9: i = i + 1 10: end while 11: return ACC used to compare different kinds of algorithms, which are the number of features contained in the final compressed feature set, the classification accuracy and the running time of the algorithms.

A. PERFORMANCE OF THE GENETIC BASED FEATURE COMPRESSION ALGORITHM
In this experiment, we test the performance of the genetic based feature compression algorithm. Firstly, we produce a population with 25 solutions by Algorithm 1. The crossover probabilities p c and mutation probabilities p m are set to (0.5, 0.05), (0.7, 0.05), (0,9, 0.05), (0.5, 0.1), (0.7, 0.1), (0.9, 0.1), (0.5, 0.2), (0.7, 0.2) and (0.9,0.2) respectively. We run the genetic based feature compression algorithm (Algorithm 3) for 20 generations. The maximum fitness value in each generation, calculated by Algorithm 2 for each case of p c and p m , is shown in Figure 3, which shows that the fitness values generally become bigger and bigger with the increasing generation. After 20 generations, the fitness value for each pair of (p c , p m ) is steady and it is the maximum fitness value among all generations. We use the solution with the maximum fitness value as the selected compressed feature set to test the performance of the genetic based feature compression algorithm.
Then we produce a population with 50 solutions by Algorithm 1. The crossover probabilities and the mutation probabilities are set to the values as same as that in Figure 3. We run Algorithm 3 for 10 generations. The fitness values, calculated by Algorithm 2 for each generation, are shown in Figure 4. It also shows that, after 10 generations, we get the solution with the maximum fitness value for each pair of (p c , p m ) among all generations, which can be used as the selected compressed feature set to test the performance of the genetic based feature compression algorithm. After running Algorithm 3 for the case of the initial population with 25 solutions (n = 25) for 20 generations, we get 9 solutions, which are 9 sets of compressed features for the 188D features. In the same way, we get another 9 sets of compressed features in the case of 50 solutions in the initial population (n = 50). In Figure 5, we compare the number of features contained in the final compressed feature set for n = 25 and n = 50. On average, the number of compressed features for n = 25 is 101, which is less than 104 for n = 50.
In Figure 6, we compare the classification accuracy of the 9 groups of compressed feature sets got for n = 25 and n = 50.  It shows that the maximum accuracy is achieved in case of n = 25, p c = 0.5 and p m = 0.05, in which case the number of compressed features is 111.

B. PERFORMANCE OF THE GREEDY BASED FEATURE COMPRESSION ALGORITHM
In this experiment, we test the performance of the greedy based feature compression algorithm. The correlation based, infoGain based and GainRatio based methods are used to evaluate the rank of each feature class. The experimental results are shown in Figure 7. The x axis is the number of feature classes being used to classify the cytokine data. The y axis is the classification accuracy based on the selected features in the feature class. Figure 7 shows that when the number of feature classes selected is few (less than 4), the accuracy of SVM classifier is poor. With the increasing of the number of feature classes selected, the accuracy becomes better and better. Among the three kinds of evaluation methods, the accuracy of the correlation based method is the most steady. The best accuracy is achieved by the InfoGain based method when 8 feature classes are selected, with 167 features. The classification accuracy of the selected features is also better than that of the 188D features.

C. PERFORMANCE OF THE BRUTE-FORCE BASED FEATURE COMPRESSION ALGORITHM
In this experiment, we test the performance of the brute-force based feature compression algorithm. The x axis is the number for a kind of feature selection strategy. The y axis is the   classification accuracy corresponding to the feature selection strategy. It shows that when the number is 383, the brute-force based algorithm achieves the maximum accuracy, which is better than the greedy based algorithm and the genetic based algorithm. The decimal number 383 corresponds to the binary number 101111111, which means only the second feature class is not selected as the classification feature by the SVM. The performance of feature compression for the brute-force based algorithm is the same as that of the greedy based algorithm.

D. DISCUSSION
From the three groups of experiments, we can compare the three feature compression algorithms proposed in this paper from three aspects, which are the accuracy, feature compression and runtime.
Among the three algorithms, the genetic based algorithm gets the minimum compressed feature set with the same classification accuracy. But as the search space is very large for the genetic based algorithm, it is hard to find the global optimal accuracy, so in our experiment the best accuracy of genetic based algorithm is not better than the other two kinds of algorithms. As the fitness value is measured by the classification accuracy of the SVM, the genetic based algorithm needs to constantly train the SVM for each solution in the population. It's why the time spent by the genetic algorithm is the longest among the three algorithms.
The brute-force based algorithm achieves the best classification accuracy among the three kinds of algorithms. The number of compressed features is the same as that of the greedy based algorithm. But it consumes much more time than the greedy based algorithm because it needs to train the SVM for hundreds of feature selection strategy.
The greedy based algorithm makes a good trade-off among the accuracy, the number of compressed features and the runtime. It can get better accuracy than the original 188D features with fewer features. The runtime of the greedy based algorithm is much less than that of the other two kinds of algorithms because it only needs to train the SVM for 9 times.

V. CONCLUSION
In this paper, three kinds of feature compression algorithms are proposed to compress a 188D feature set, named the genetic based, the greedy based and the brute-force based feature compression algorithm. The experimental results show that the brute-force based algorithm achieves the highest classification accuracy. The genetic based algorithm selects the least number of features from the 188D features as the compressed feature set. The shortcoming of the two algorithms is that they consume much time because they constantly run the SVM classifier during the procedure of feature selection. The greedy based algorithm makes a good trade-off among the classification accuracy, the number of compressed features and the time consumption.