A Dynamic-Time Distance Based on Wavelet Decomposition for Subcellular Localization Classification

The use of bioinformatics to predict protein subcellular localization is a significant method to study protein function. In the present study, a novel dynamic-time distance measurement based on wavelet packet decomposition (WPD) was proposed for protein subcellular localization prediction. The protein sequence was firstly converted into multiple numerical signals according to physical/chemical properties. Following signal decomposition into multiple subsignals by wavelet packet decomposition, a comprehensive dynamic-time distance was obtained by the dynamic time warping (DTW) algorithm. Finally, the expected classification can be produced based on the new distance measurement. By introducing DTW, the present algorithm can overcome the shortcoming that traditional methodologies can not measure the similarity of unequal-length sequences. Additional, multi frequency band subsignals can retain more information to achieve accurate results. It was suggested that our algorithm provides superior classification recognition by setting traditional methods as a comparative experiment in the E.coli dataset. It can distinguish cytoplasm and cell membrane proteins that are particularly difficult to be identified by the traditional methodologies.


I. INTRODUCTION
The development of genome sequencing has enabled the determination of a large number of protein and DNA sequence information. Currently biological researchers face the dilemma of the study method and use of the large amount of data generated by sequencing methods [1]. The computer-based approaches exert specific advantages while assimilating large amount of biological data, such as the high computational speed and low cost of experimentation. This plays an important role in the processing of biological information. The development of novel algorithms for biological sequences has become an important research interest.
Proteins are considered the material basis for living forms and constitute the basic organic matter of the cells. Subcellular location relates closely to the functioning of a protein, which is one of the key attributes of gene production. It is well known that the subcellular localization of proteins plays a crucial role in many aspects of their function. This involves the three essential features of the protein namely, biological The associate editor coordinating the review of this manuscript and approving it for publication was Navanietha Krishnaraj Rathinam. objective, biochemical activity and cellular localization that includes active gene transcription [2]. However, each protein can only exert its specific function in a particular subcellular position. The localization of these proteins provides a gateway through which functional analysis may be explored. It is necessary to develop appropriate methods for the effective prediction of the subcellular localization of a protein.
Several studies have explored the application of bioinformatic methodologies in the identification of the subcellular localization of proteins. Nakashima and Nishikawa proposed the amino acid composition for the identification of intracellular and extracellular proteins in 1994 [3]. Bulashevska and Eils introduced a recursive algorithm namely, HensBC in 2006, which constructed a hierarchical ensemble of classifiers and improved the accuracy of the prediction of certain classes with few sequences [4]. Li et al. proposed an ensemble classifier of KNN and SVM algorithms in order to predict the subcellular localization of eukaryotic proteins based on a voting system in 2012 [5]. Mandal et al. utilized the multiobjective particle swarm optimization-based feature selection technique in order to predict the subcellular localization of proteins [6]. To date, various computational techniques such as the hidden Markov model [7], the neural network [8], the K-Nearest Neighbor [9], [10] and the Support Vector Machine have been developed for the identification of the subcellular protein localization prediction [11]- [13].
The aforementioned studies have shown that protein subcellular localization is closely related to a variety of factors, while the different length of protein sequences further renders the similarity analysis more difficult, notably by the use of the traditional similarity distance method. Dynamic Time Warping (DTW) is a distance algorithm [14], which was originally proposed by the Japanese scholar Itakura in the 1960s. DTW exhibits apparent advantages in the process of signals, especially for the sequences of lack fragments and/or dimension shift. Wavelet packet decomposition (WPD) is also termed wavelet packet or subband tree. It is a very sophisticated signal analysis method, which can provide specific signal decomposition [15]. WPD has a wide range of applications in a majority of scientific fields, such as seismic wave observation and oil and gas exploration.
In the present study, according to the characteristics of protein sequences, a novel algorithm for the identification of the subcellular protein localization based on Wavelet Packet Decomposition and Dynamic Time Warping is proposed. To obtain the comprehensive feature information in the sequence, different physical/chemical properties were utilized in the algorithm to get numerical signals; By WPD, the sequence with various information were decomposed into subsignals in different frequency bands, so as to reduces the error from the comparison between different types of sequences; DTW achieves warping matching between unequal length sequences by strenching/shortening sequence signals, which can more accurately calculate the similarity between sequences. The algorithm can effectively classify the protein sequences of the cytoplasmic and the cell membrane proteins compared with the traditional methodologies.

II. MATERIALS AND METHODS
The dataset used in this article was obtained from the UniProtKB/Swiss-Prot module in the UniProt database, which is an expert-proven, high-quality, manual annotation, non-redundant dataset consisting mainly of three modules: UniProtKB/Swiss-Prot, UniProtKB/TrEMBL and UniParc. According to the different subcellular localization, the E.coli dataset was obtained. The specific information of the data set is shown in TABLE 1.

A. PROTEIN SEQUENCE NUMERIZATION
The process of numerization the eigenvector from sequence is directly related to the final classification results, since the original protein sequence is represented by the letter symbol. In the present study, we provide six numerical methods from different aspects for the execution of the classification algorithm. The method includes the contribution of the physical and chemical properties of the proteins. The descriptions of the numerical methods are as follows: Method-AAC: Method-AAC indicates amino acid composition, which may be considered the easiest of all approaches. The method enables the construction of the eigenvector based on the ratio of 20 amino acids of the entire sequence, where the first component is the percentage of alanine (A) in the entire sequence and the second component is the percentage of arginine (R).
Method-cwt: Method-cwt creates a wavelet feature vector that is a method of digitizing data by using a continuous wavelet transformation. This method provides the advantage of characterizing the local characteristics of a signal based on the parameters time and frequency. By selecting the option low frequency, the method confers high resolution within the time domain and low resolution within the frequency domain. By selecting high frequency, the resolution is the reverse. It automatically adapts to time-frequency signal analysis capabilities, making it possible to focus on any detail of the signal. Method-cwt uses the continuous wavelet transformations in order to translate the original signal to a requisite dimension eigenvector.
Method-Hy: Method-Hy digitizes the amino acid sequence according to its hydrophobicity. Unlike method-AAC, the dimension of the eigenvector constructed by method-Hy is not fixed and it varies according to the length of the amino acid sequence. For example, if the specific numerical process is the amino acid X in the sequence, then X is replaced by X's hydrophobic value x. The hydrophobic values of the 20 amino acids are shown in TABLE 2.
Method-MW: Method-MW digitizes the amino acid sequence according to its relative molecular mass. In the chemical formula, the sum of the relative atomic mass of each atom is the relative molecular mass. In a similar mode as method-Hy, the characteristic vector created by method-MW also varies with the length of the amino acid sequence. The relative molecular mass of the 20 amino acids is shown in TABLE 2.
Method-Rf: The concepts of polarity and nonpolarity are often associated with hydrophobicity. According to the principle of similar compatibility, the polar solute is dissolved in the polar solvent, while the nonpolar solute is dissolved in the nonpolar solvent. Water is often used as a solvent, which is polar. Consequently, the polar amino acids can be regarded as water-soluble amino acids, whereas the nonpolar amino acids are not readily dissolved in water. Rf is the mobility of amino acids, depending on their polarity and hydrophobicity. The greater the polarity, the smaller the mobility, whereas the greater hydrophobicity gives rise to larger mobility. Method-Rf digitizes the amino acid sequence so that each amino acid is replaced by its Rf value. The standard Rf values of the 20 amino acids are shown in TABLE 2.
Method-pI: The isoelectric point (pI), is the pH at which the separated positive and negative ions of the amino acid and/or protein carry no net electrical charge in the statistical mean. The isoelectric points of the 20 amino acids are shown in TABLE 2.
It is generally accepted that the use of a single physical quantity cannot effectively reflect the overall physical and chemical properties of the protein sequence. Moreover, the replacement of the amino acids with their corresponding values solely on the basis of the protein sequence, frequently omits the geometrical structure in space. Following the representation of one protein with certain spatial structure by a feature vector of one dimension, significant structural information will inevitably be lost, which will in turn affect the subsequent classification. To overcome this disadvantage, the Wavelet Packet Decomposition (WPD) and Dynamic Time Warping (DTW) were introduced based on the six numerical methods. WPD can decompose and reconstruct the signals and divide them into a smaller dimension subsignals in a specific manner. Thus it can achieve the distinction between different types of signals, and additionally reduce the misclassification. DTW is further capable of dynamically matching two sequence samples. Thus, when the matching sequences exhibit different time dimension, the best match between the samples can be selected by non-linear warping, this improves the classification accuracy. Right after, the main concept of the study is presented namely, a dynamic-time distance method based on wavelet packet decomposition.

B. WAVELET PACKET DECOMPOSITION
Wavelet packet decomposition (WPD) is a very sophisticated signal analysis method, which can divide the bands into multiple levels, and select the corresponding frequency band in order to match the signal spectrum and ultimately improve the time-frequency resolution. The excellent time-frequency characteristics of the WPD can improve the wavelet transformation. On the basis of wavelet decomposition, the discrete detail is decomposed in order to yield the binary tree decomposition of the signal. The incomplete WPD can continue to decompose or terminate the decomposition on any node. The signal band is then divided according to the requirements of the actual signal analysis. Equation (1) is the function of the orthogonal wavelet packet decomposition.
In (1), is a pair of orthogonal mirror filters. When n=0, u 0 (t) and u 1 (t) are the scale function ϕ(t) and the wavelet function ψ(t).
The WPD diagram is shown in Fig. 1 and the three-tier decomposition is used as an example. Specifically, WPD was used in order to convert the original signal S. The signal was divided into two types of signals: low frequency and high frequency respectively, denoted as A 1 and D 1 . Subsequently, the two signals were decomposed again in order to yield low frequency of A 1 , high frequency of A 1 , low frequency of D 1 and high frequency of D 1 . The aforementioned parameters were denoted as AA 2 , DA 2 , AD 2 and DD 2 respectively. Similarly, WPD was used in order to additionally convert the four signals. As a result eight signals of high and low frequency were obtained, which were denoted as AAA 3 , DAA 3 , ADA 3 , DDA 3 , AAD 3 , DAD 3 , ADD 3 and DDD 3 . In the present study, the wavelet packet decomposition was used that is based on the haar wavelet. This indicates that the signal dimension can be strictly divided into one-half. Consequently, the 2k signals of the kth layer exhibit the 1/2 k dimension of the original signal S. For example, an original signal S with dimension N=800, following 3 times of decomposition, is further decomposed into 2 3 =8 signal fragments, and the dimension of each signal fragment is reduced to 100.
Following protein sequence numerization, the digital signal was decomposed into a number of smaller signal segments according to a particular rule by using WPD. Consequently, the different information in a sequence was decomposed into specific fragments, which is conducive in order to explore the new distance measure, and finally enhance the recognition accuracy.

C. DYNAMIC TIME WARPING
Equation (2) gives the mathematical formula of the Euclidean distance.
It should be noted that X and Y both are feature vectors whose dimension is n in the aforementioned equation. This indicates that the Euclidean distance can be calculated based on the precondition that the dimension of the two vectors is the same. But for unequal length signals, it is not feasible to use the traditional Euclidean distance.
In this aspect, the DTW exhibits certain advantages. Dynamic Time Warping (DTW) is a distance algorithm, which is suitable for unequal length signal. The unknown sequence is stretched and/or shortened until it coincides with the length of the reference template. In this process, the time axis of the unknown sample is distorted or bent so that its feature corresponds to the standard pattern. A total of two vectors with any dimension can identify the dynamic-time distance by DTW. This results in the distortion of the two samples to the standard model in the time domain, and the enhancement of the classification accuracy. Fig. 2 indicates the general process of DTW. Initially, the corresponding points of the two sequences and the warp sequence were identified using the standard model and subsequently the regular distance can be estimated according to the corresponding points.
In (3), i=1,2,. . . ,n; j=1,2,. . . ,m. D(X,Y) represents the distance of the vector X and Y . W is a regular path. The regular path is a set containing multiple path points of pairwise w k =(i,j), and always originates with (1,1) and ends with (n,m). A total number of L paths was selected for the process of the dynamic-time distance calculation of X and Y . D(X,Y) is the optimal path, so as to indicate the dynamic-time distance of the shortest path. Different paths exhibited different number of path points (K ). The path point w k =(i,j) indicates that the distance between the component x i of X vector and the component y j of Y vector, and their average value calculated into the cumulative distance. Generally, the dynamictime distance for a vector of smaller dimensions is always smaller due to the fact that the path points on the optimal path are always smaller. Therefore, the cumulative distance was divided by K in order to calculate the average distance as D(X,Y).
Another way to explain the process of DTW is shown below. For the same parameters namely, X = (x 1 ,x 2 ,. . . ,x n ), Y = (y 1 ,y 2 ,. . . ,y m ), a distance matrix D of n×m was constructed initially, where d(i,j) in the matrix represents an estimate of the square of the distance between x i and y j . The cumulative distance matrix was denoted as D and D(n,m)/K represented the dynamic-time distance of X and Y . The conditions for the construction of the cumulative matrix D were as follows:

. ,m, D(i,j) may be d(i,j) plus any one of D(i−1,j), D(i,j−1), D(i−1,j−1). The smallest value is also calculated by its left, top and/or upper left elements. The path from
The signals analyzed in the present study are biological signals, and their source organisms are usually mutated. Biological mutations frequently lead to biological signal changes. Specific examples include omission of certain amino acid molecules, chromosome fragments exchange and/or protein sequence length changes. Therefore, the use of DTW in the biological signal processing exhibits specific advantages compared with the traditional distance algorithm.

D. THE PREDICTION ALGORITHMS
The key point of the proposed method is the combination of WPD and DTW. Firstly, the numerical signals of the test samples and the training set are divided into 16-segment subsignals from low to high by four-layer WPD. This results in preprocessing of the information. Subsequently, the DTW compares the test sample with the 16 subsignals of the training sample in order to obtain 16 distances. The sum of the 16 distances is further compared as the dynamic-time distance between them. Finally, the k-nearest neighbor method is used to determine the partition of the test sample.
In the present study, we used six numerical methods to yield eight different algorithms and compare their performance. The specific algorithms are shown in Fig. 3: The second row in Fig. 3 represents the six aforementioned numerical methods. For any protein sequence, the first four numerical methods create features with unfixable dimension, which change according to the length of the protein sequence. The latter two numerical methods are used as contrast in the experiments to create 20-dimensional features. The third row indicates the selection of WPD based on the previous eigenvectors. In the present study, we used the four-layer wavelet packet decomposition. The fourth row indicates the application of the DTW algorithm based on WPD.
The fifth row demonstrates the classification algorithm that was used in the present study. KNN and SVM are well-known supervised classifiers, they are widely-used for their good performance, easy to understand and implement. Considering that our method can estimate the distance between test samples and training samples, KNN and SVM are suitable for the prediction task.
Therefore, the final classification algorithm adopts the k-nearest neighbor method, and classifies the test sample according to the largest number of training samples. For wavelet features and amino acid composition features, the support vector machine (LIBSVM) was used in order to perform the classification. LIBSVM exhibits great impact and high classification effect. In addition, 10-cross validation was used that ensured division of the dataset into ten subsets. Each routine sample test was conducted as a test set and the remaining sets were used for the training set in order to calculate the final result. The average of ten results was used as the final classification accuracy. The sixth row indicates the abbreviation of the different methods.
The entire process of the prediction is shown in Fig. 4. It should be noted that all the steps in Fig.4 are required for KNN method, but the steps in dashed box are not need for SVM method.
The mathematical basis behind the KNN-Hy method includes numerical method, wavelet analysis, similarity distance and statistic-based learning. The dimension of the eigenvector constructed by hydrophobicity varies with the length of the protein sequences, DTW is a distance algorithm which is suitable for unequal length sequences. Wavelet analysis has been viewed as a ''mathematical microscope'' so that subsignals from WPD contain different structure information of protein sequences. KNN is a machine learning algorithms
In DTW_matrix, the jth row-vector indicates the distances between the jth test sample and all training samples. We can use the function knn_min to get the prediction result of the jth test sample according to KNN algorithm.
Where, wpd16 is a function to decompose a signal into 16 segments. dtwd is a function to get the DTW distance between WP1 and WP2, which is the sum of one-to-one distances between 16 segments in pairs.

III. EXPERIMENTS AND ANALYSIS
The data sets used in this study are based on a different subcellular localization that is downloaded from UniProt. The eight algorithms were respectively used in the experiment with the E.coli dataset, and the result is shown in Fig. 5.
The results of wavelet and amino acid composition on the E.coli dataset can be estimated accurately based on the new method proposed in the current study. The optimal classification is KNN-Hy and the accuracy rate can increase up to 83.76%. The lowest accuracy rate was noted for SVM-cwt and was estimated to 62.95%. The new method was applied on wavelet features and the amino acid composition features, and the classification accuracy was considerably improved. The wavelet features were increased from 62.95% to 75.84%, while the amino acid compositions from 64.67% to 76.26%. The use of WPD in combination with DTW improved the subcellular localization classification are shown in Fig. 6.
The use of different standards in the numerical process was employed in order to examine the numerical difference of the various physical quantities. The first four numerical methods were compared and are shown in Fig. 7.
KNN-Hy, which utilizes the hydrophobicity of amino acids, can achieve the optimal performances among the four algorithms in the overall classification and the classification accuracy was estimated to 83.76% (Fig. 7). However, KNN-pI, which uses the isoelectric points of the amino acids, can achieve the optimal performance among the cell fluids, membranes and quasi-nuclear subcellular localizations.  The classification accuracy rates were 97.22%, 98.98% and 97.41% respectively, with the exception of the cytoplasmic locations that were estimated to 3.95%. Consequently, the overall classification of KNN-pI is the lowest and is estimated to 66.19%. Specific experiments for the detection of the localization of cytoplasmic proteins were designed in order to further analyze the performance of KNN-pI.
The incorrect classification of the protein sequence is shown in blue, and the correct classification in orange. The majority of the cytoplasm sequences are incorrectly classified as membrane sequences by the KNN-pI algorithm (Fig. 8). Consequently, the use of isoelectric point numericalized protein sequences cannot effectively distinguish between cytoplasmic and cell membrane protein sequences, and this eventually results in lower classification accuracy. It is worth mentioning that the classification accuracy of the cytoplasm is 0% as determined by the SVM-AAC and SVM-cwt methods, and more than 98% are partitioned to the membrane sequence. Although the majority of the membrane can be assigned to the right partition, this does not ensure that the algorithm can identify the cell membrane sequence. In contrast to this hypothesis, KNN-Hy, KNN-MW, KNN-Rf, KNN-cwt and KNN-AAC were effective in distinguishing between the cytoplasmic and cell membrane regions. With regard to the KNN-pI method, the classification results of the cytoplasm and the cell membrane using the KNN-Hy method are shown in Fig. 9.   Furthermore we analyzed the dynamic-time distance using the KNN-Hy algorithm. By calculating the dynamic-time distance of each sample in the dataset, we estimated the distance, and constructed the histogram of the distance (Fig. 10). The maximum density of the distance was between 20 and 25.
Its frequency was estimated by the multiplication of the density with the width of the histogram. This was approximately 0.09 * 5=0.45. According to the four types of subcellular localization, the distances were classified into two categories of protein sequences with ten indicators. These were the following: four intra-class distance, and  six inter-class distance. In theory, the purpose of the present analysis was to design a classification algorithm that reduces the intra-class distance and increases the inter-class distance as much as possible. The intra-class distance of the membrane sequences was slightly larger compared to the other classes (Fig. 11).
This may be due to the other sequences that exert special class characteristics compared with the membrane sequence. Therefore, a part of the aforementioned algorithm exhibits poor performance in the distinction of the membrane sequences, as opposed to other categories.
The accuracy of all eight algorithms is provided in detail in TABLE 3.
It should be noted that the KNN-Hy algorithm exhibited optimal performance on the prediction of the subcellular localization of the E.coli dataset. The algorithm was capable of discriminating the cell membrane and cytoplasmic protein sequences that is considered a particularly difficult task.
The average classification accuracy by KNN and SVM are respectively shown in Fig.12 and Fig. 13.
As can be seen in Fig. 12 and Fig. 13, the SVM method can not distinguish cytoplasm from cell membrane, and the classification accuracy by KNN method is not high. Aiming at the sequences in cytoplasm and cell membrane, we further analyze their confusion matrix, as shown in TABLE 4: Without considering cell fluid and quasi nuclear, the classification accuracy of cytoplasm increases slightly, but there are still some overlap and confusion samples between cytoplasm and cell membrane. According to Fig. 11, their intra class distance and inter class distances are very close, which may result in the unclear boundaries between the two types of data. In addition, the large sample size also increases the difficulty of identification.

IV. CONCLUSION
In the present study, a dynamic-time distance based on WPD is proposed for the prediction of subcellular protein sequence localization. The algorithm exhibits certain advantages in the processing of biological sequences, which include the following points: Firstly, the algorithm can compare two sequences with different length. Despite the same subcellular localization of the biological sequence, the difference in the sequence length may generate a great error with the traditional method, whereas the current algorithm can always provide reasonable distance measurement. Secondly, the idea of the comprehensive signal according to the signal frequency in the algorithm can further reduce the error in the processing, which is caused by digitizing the biological sequence with a single physical quantity. Thirdly, this algorithm can effectively reduce the errors caused by dynamic matching due to the absence of amino acid molecules, the exchange of chromosome fragments, and/or the change in the protein sequence length. With regard to the subcellular localization classification of the Escherichia coli dataset, the traditional method cannot distinguish between the cytoplasm and the cell membrane effectively, although the classification accuracy rate of the present algorithm is approximately 83.76%. Theoretically, the classification algorithm proposed in the present study can be applied to other biological sequence classification and/or other signal processing problems.
Moreover, a potential way to perfect the method is to select an optimal numerical method. In the current analysis, a numerical method, which can effectively reflect various physical and chemical properties of biological sequences, can improve the classification performance significantly. In our experiments, the proteins with classification label from UniProtKB are used to verify the performance of the algorithm. In theory, as a similarity measurement, dynamictime distance based on WPD can also be introduced into some unsupervised clustering algorithms based on distance [16], [17], which is a very promising method to achieve the subcellular localization prediction for unknown proteins in the future.