Identification of Secreted Proteins From Malaria Protozoa With Few Features

,


I. INTRODUCTION
Malaria is an insect-borne infectious disease caused by malaria protozoa. According to the World Health Organization, approximately 1 million people are infected with malaria. As has been done for other diseases [1], [2], correctly identifying the proteins responsible for malaria could provide important clues for understanding pathogeny and discovering drug targets [3], [4] for molecular therapy [5]- [7]. Malaria protozoa secrete a series of proteins into host red blood cells, which can be used as vaccine targets or potential drugs. Therefore, it is quite important to distinguish the secreted proteins from malaria protozoa. Biochemical experiments can be used to identify all malaria proteins; however, this method is time-consuming and laborious.
In the past decade, a number of bioinformatics methods have been introduced to forecast the secreted proteins from malaria parasites and other types of disease causal molecules [8]. Verma et al. [9] prepared a benchmark data set incorporating 252 non-secretory proteins and 252 secretory The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou . proteins and devised a method based on support vector machine (SVM) to forecast the secretory proteins of malaria parasites. Machine learning has been widely employed in protein identification [10], [11] and biomedical diagnosis [12]- [14]. The core idea of ''machine learning'' is to convert things that are difficult to compare (e.g., strings, voice signals, and pictures) into vectors (a vector is a line of a matrix, i.e., several numbers). The process of converting a string (or picture or voice signal) into a vector is called ''feature extraction'', and this vector is called a feature. Zuo and Li used the K-minimum increment of diversity (K-MID) method to predict secreted proteins using 169-dimensional features [15]. Lin et al. used the Grey system model to add sequence evolution information to the composition of pseudo-amino acids and used 60-dimensional characteristics to predict secretory proteins [16]. Fan et al. developed a prediction method called Discriminating secretory proteins of malaria parasite (DSPMP), which uses 300-dimensional features [17]. Tang et al. incorporated the chemical and physical properties of amino acid residues into a prediction model and used 53-dimensional features for predictions [18]. Guo et al. proposed various data compression models on sequence evolution information and chemical-physical properties to predict specific functional proteins [19]- [22].
The models used in the literature have high dimensions. The purpose of this study is to design a low-dimensional model to forecast the secreted proteins from malaria parasites.

II. MATERIALS AND METHODS
The research process is represented by the flow chart shown in Fig. 1. The process includes the constructing of a benchmark dataset, extracting sample features, feature optimization, selecting a machine learning method, proposing a model evaluation strategy and evaluating results [23], [24]. The following sections will introduce each step in detail.

A. BENCHMARK DATASET
The benchmark dataset used in this article was set up by Verma et al. [9]. The benchmark data set incorporates 504 proteins and is divided into two categories. The positive data set incorporates 252 secreted proteins from malaria protozoa, and the negative data set incorporates 252 nonsecretory proteins of malaria protozoa. Zuo, Li, Lin, Fan, Hua and others have also used the same benchmark dataset.

B. FEATURE EXTRACTION
Machine learning algorithms cannot directly annotate the continuous amino acid sequence, so this is a necessary step to convert the amino acid sequence represented by strings into numerical feature vectors [25]- [29]. The feature extraction methods play important roles in constructing computational predictors.
In this paper, we used IFeature [30], a multi-functional web server and python-based package, to generate diverse numerical features representing schemes for peptide and protein sequences. IFeature can calculate and extract the complete spectrum of eighteen main sequence coding schemes containing fifty-three diverse sorts of feature descriptors. In addition, IFeature enables users to extract specific amino acid attributes from the AAindex database [31]. IFeature also integrates twelve diverse sorts of frequently used feature selections, dimension reduction and clustering algorithms (which facilitates the analysis), benchmarking and training machine learning models. This paper used amino acid components, conjoint triad, pseudo-amino acid components, C/T/D and grouped amino acid composition feature extraction strategies for feature extraction.

1) AMINO ACID COMPOSITION (AAC)
The protein sequence consisted of twenty categories of amino acids [32]. The frequency of each amino acid in a protein that has been widely used in protein classification is calculated to describe protein samples. Thus, each protein sample can be formulated as a 20-dimensional feature vector.

2) PSEUDO-AMINO ACID COMPOSITION
A pseudo-amino acid composition model (Pse-aac) was proposed by Chou et al. [33]. In this model, the features of an amino acid composition are retained, and the position information is represented by extending the feature vector. Thus, the feature vector of the pseudo-amino acid composition is expressed as follows: Pse-aac = (x1, x2, x3, . . . x20, x20 + 1, . . . , x20 + n The first 20 components x1, . . . , x20 represent the frequency of the occurrence of each amino acid, and the latter component x20 represents the position information of the residues in the amino acid sequence. For specific implementation methods, see [33]. This feature has been widely used in protein subcellular localization prediction, protein structural classes prediction and other protein prediction fields.

3) GROUPED AMINO ACID COMPOSITION
The grouping amino acid composition model (Gaac) classifies twenty amino acid types according to their chemical and physical properties and subsequently calculates the composition of each sort and interactions between groups. The feature reflects the conserved motifs in proteins. Thus, it was also used in protein identification.

4) C/T/D
The C/T/D model was proposed by Dubchak et al. [34]. The model takes into account three properties of amino acids, namely, relative hydrophobicity, secondary structure and solubility. Amino acids are divided into two groups based upon solubility, three groups based upon relative hydrophobicity, and three or four groups based upon secondary structure. Each group is described by three descriptors C/T/D [35]. For the specific implementation method, see [34].

5) CONJOINT TRIAD
The conjoint triad (CT) model was proposed by Shen et al. [36]. The model divides amino acids into seven categories, taking into account the properties of one amino acid and its adjacent amino acids, and takes any three consecutive amino acids as a unit. Thus, the triads can be distinguished according to the classes of amino acids, for example, triads composed by 3 amino acids belonging to the same classes, such as VKS and ART, could be regarded identically, because they may be considered to play similar roles. A binary space (V,F) is used to represent a protein sequence. V is the vector space of sequence features. Each feature (vi) represents three values. F is the frequency vector corresponding to V, and the i-dimensional value of F(fi) is the frequency of vi types in the protein sequence.

C. FEATURE SELECTION
Feature selection is the process of selecting a subset of relevant features (predictors, variables) used in model construction. The feature dimensions will be reduced after selection, FIGURE 1. The main flow chart of the research process in this paper. VOLUME 8, 2020 so this process is also called dimension reduction [37]- [44]. In this paper, MRMD1.0 and MRMD2.0 are used to decrease the dimension of features. The MRMD1.0 feature selection method is mainly determined by two parts: the first is the correlation between features and instance class marks. The other is the redundancy between features. MRMD1.0 uses Pearson correlation coefficients to calculate the correlation between features. Three distance functions (Tanimoto coefficient, Euclidean distance and Cosine distance) are used to calculate the redundancy between features. The larger the Pearson correlation coefficient, the closer the relationship between class labels and features. The greater the distance, the lower the redundancy between features. Finally, MRMD1.0 selects a subset of features strongly related to the class label and having low redundancy between features. PageRank (PR) is a mathematical algorithm used to evaluate the quality and quantity of links to a web page. MRMD2.0 uses the idea of PRand combines other feature selection methods, including ANOVA, MRMD, MRMR, LASSO, and MIC.

D. CLASSIFIER SELECTION 1) WEKA AND RANDOM FOREST
Waikato knowledge analysis environment (Weka) is a famous machine learning software for predictive modelling and data analysis. In this study, Weka is used as the platform. The ''classification'' option of Weka offers a variety of classifier patterns, such as randomforest, zeroR, kstar and libsvm. Random Forest has been widely utilized in bioinformatics [45]- [52]. In this paper, random forest (rf) was adopted as a classifier, and 10-fold cross-validation is used to determine its performance.

2) SUPPORT VECTOR MACHINE
In this study, support vector machine algorithm is also used for prediction. Data examples labelled positive or negative are projected into a high dimensional feature space. The hyper plane of the feature space is optimized by maximizing the margins of positive and negative data. SVM has been extensively utilized in the field of genome, transcriptome and proteome prediction [53]- [64]. Thus, SVM is a quite an effective machine learning method. In this paper, libSVM is used to optimize the prediction results of support vector machines using grid method to adjust parameters c and g.

E. PREDICTION ACCURACY EVALUATION
It is very important to quantitatively estimate the performance of the proposed method. In this paper, the following commonly used indices are selected to estimate the performance of the model [65]- [78], Sp, Sn, Acc, and Mcc, as shown at the bottom of this page, where TP is the correctly forecasted quantity of secretory protein, FP is the quantity of non-secretory proteins forecasted as secretory proteins, TN is the correctly forecasted quantity of non-secretory proteins, and FN is the quantity of secretory proteins forecasted as non-secretory proteins.

III. RESULTS AND DISCUSSION
Five feature extraction methods of IFeature were used: amino acid composition, grouped amino acid composition, pseudo-amino acid composition, conjoint triad and C/T/D. Table.1 lists the exact dimensions obtained by each algorithm. Each feature was tested through 10-fold cross-validation by using the random forest classifier. Fig. 2 shows the prediction accuracy of each algorithm. The accuracy of amino acid composition was low. While the accuracies of the other four methods were relatively high, the dimensions of each feature were too high. Thus, this paper separately reduced the dimensions of each feature.
Using MRMD1.0 to compare each kind of feature after feature selection and before dimension reduction, the results are listed in Fig. 3(a) and Fig. 3(b).
After each kind of feature was selected by MRMD2.0, they were compared with their state before dimension reduction, and the results are listed in Fig. 3(c) and Fig. 3(d).
By comparison, no absolute differences were observed between the two feature selection software. Thus, the follow-up study used both software. Because the dimension of conjoint triad remained high after feature selection by either software, and the effect of feature selection of amino acid composition using MRMD1.0 was not ideal, the subsequent study of MRMD2.0 only considered amino acid composition, C/T/D, grouping amino acid composition and pseudo-amino acid composition features. MRMD1.0 only considered grouping amino acid composition, pseudo-amino acid composition and C/T/D features.   The dimension before and after dimension reduction by using MRMD1.0(a). The accuracy of prediction before and after dimension reduction by using MRMD1.0(b). The dimension before and after dimension reduction by using MRMD2.0(c). The accuracy of prediction before and after dimension reduction by using MRMD2.0(d).
After merging the feature files selected by MRMD2.0, 96-dimensional features were obtained using MRMD2.0 for feature selection. The accuracy was 90.4762% using random forest for 10-fold cross-verification. MRMD2.0 provided each feature a numerical value (the higher the value, the stronger the recognition ability of the feature). The features were arranged in descending order according to the size of the value. Then, the classifier was used to study the performance of the first feature with the largest value. The feature with the second highest score was added to verify the performance of the new feature subset. This process was repeated until all features were verified. Finally, a series of parameters of the features in different dimensions were obtained, including accuracy, f1, precision, recall, and roc. According to the  obtained indicators, accuracy was correlated with dimension ( Fig. 4a); accuracy reached 87.3% when the dimension was 7. The classifier used in mrmd2.0 was different from that used in this paper, so the parameters obtained in mrmd2.0 can only be used as reference. Seven features were separated; the accuracy was 90.0794% using random forest. The three features of the first three scores were separated and studied by random forest; the accuracy was 82.7381%.
After merging the feature files selected by MRMD1.0, 347-dimensional features were obtained using MRMD1.0. The accuracy was 90.0794% using random forest for 10-fold cross-validation. Using the same method (Fig. 4b), when the dimension was 5, the accuracy reached 84.92063%. The five-dimensional features were separated; the accuracy was 85.7143% using random forest. The accuracy was 83.5317% using random forest to separate the three-dimensional features with the first three scores.
Because the dimension of the double feature selection of MRMD1.0 remained high, the 347-dimensional features file after the second selection was again reduced by MRMD2.0 to yield 301-dimensional features; the accuracy was 90.873% using random forest for 10-fold cross-verification. Using the same method (Fig. 4c), the accuracy was 83.33% when the dimension was 3. By separating the three-dimensional feature and using random forest, the accuracy was 86.1111%. The three-dimensional feature was studied again by the support vector machine using grid method to adjust parameters c and g; the c and g were respectively 32.0 and 3.0517578125e-05, the accuracy was 88.8889%. The classification accuracy of support vector machine was better than that of random forest, so support vector machine was finally used as the classifier to build the model. Finally, we randomly separated 80% of the data set as training set and the remaining 20% as test set, and used support vector machine without gird method to forecast, the accuracy was 90.099%.
The features of the first three scores were separated as follows: 1. Xc1.c is the frequency of occurrence of cysteine obtained using the pseudo-amino acid composition feature extraction strategy. 2. postivecharger.postivecharger.gap2 first divides amino acids into five groups by Grouped Amino Acid Composition and then calculates the interactions between postivecharger and postivecharger by Composition of K Spaced Amino Acid Group Pairs. 3. hydrophobicity_ENGD860101.1.residue50 refers to the percentage of 50% polar amino acids in a protein chain to the total number of amino acids in that chain. The specific performance indicators of the 3-dimensional model based on random forest and support vector machine are shown in Table.2. The results of other methods published on the same benchmark data set are also listed. The results showed that the prediction results of the support vector machine were better than that of random forest for the three-dimensional model separated in this paper. At the same time, the prediction method proposed in this paper can also supplement other existing prediction methods of secreted proteins from malaria protozoa.
The feature dimension of the model used in this paper was less than those of other published models. This avoids dimension overload caused by the high dimensions of the feature vector [79].

IV. CONCLUSION
Molecular characterization of diseases has proved valuable for curing diseases [80], [81]. The identification of secreted proteins from malaria protozoa is helpful to identify targets for anti-malarial drugs and to design anti-malarial drugs. In this paper, good results were obtained using the three-dimensional features of postivecharger.postivecharger.gap2, Xc1.C, and hydrophobic-ity_ENGD860101.1.residue50 to predict secretory proteins. The method proposed in this paper can potentially be applied to other protein classifications and biological problems [82], [83]. In the future, we will construct a more robust model to improve predictions by using the ensemble classification algorithms [50], [84]- [88] and deep learning [89]- [91].