The Classification of Enzymes by Deep Learning

,


I. INTRODUCTION
As a type of very important biocatalyst, enzymes have a vital role in maintaining the life activities of organisms. They dominate the metabolism, nutrition, energy conversion and many other chemical reactions closely related to the life process. Most enzymes are proteins, while others are ribonucleic acid. To date, research on the prediction of enzyme classes and subclasses has always focused on the enzyme whose chemical essence is protein; that is, when we use computational models to classify enzymes, the feature extraction method we adopt is always for proteins. To facilitate the further study of enzymes, the International Union of Biochemistry (IUB) established an International Commission called the Enzyme Commission in charge of developing a nomenclature for enzymes. The Commission has classified enzymes into 7 main classes. The nomenclature of the EC is composed of four parts of figures that identify the main class, subclass, sub-subclass and substrate class of the enzyme. Most classifiers designed by scholars can classify enzymes to the level of subclass [1]- [3]. To make it easier to understand the classes of enzymes, they are visualized in Table 1. Since the category of translocases The associate editor coordinating the review of this manuscript and approving it for publication was Dariusz Mrozek . had not been proposed for many years, most prediction methods divide the enzymes into the other six categories [4], [5]. It is worth mentioning that in recent years, classifiers have appeared that can classify enzymes to the level of substrate class.
In the field of biology, wet laboratory-based functional identification procedures were adopted to determine the function and category of enzymes, but this kind of experiment is costly and time consuming [6]- [12]. Therefore, classifying enzymes using bioinformatic tools is suitable. With the development of bioinformatics and deep learning [13]- [28], scholars have designed many models for the prediction of enzyme classes [29]. In 2009, Nasibov et al. [30] adopted the method of K-nearest neighbor (KNN) classification. In 2010, Qiu et al. [31] used support vector machine (SVM), obtaining good results. In 2010, Concu et al. [32] used linear discriminant analysis (LDA) and artificial neural networks (ANN) and compared the final classification results. In addition, in order to achieve better prediction results, scholars usually combine various feature extraction Methods and classification methods in their prediction process. For instance, Shen et al. [33] combined functional domain (FunD) and pseudo position-specific scoring matrix (PsePSSM) to extract features in 2009. Wang et al. [34] combined composition,  transition and distribution (CTD) and pseudo-amino acid composition (PseAAC) to extract features and classify sequences with the combination of the methods of random-k-label-random forest (RAkEL-RF) and multi-label KNN (MLKNN) in 2014. In 2019, Ryu et al. [35] used DeepEC, consisting of three different convolutional neural network (CNN) structures in enzyme classification. Generally, the prediction process can be roughly described as two steps: extracting features from sequences first, and then classifying the feature set by a classification model. The specific prediction process is shown in Fig 1. The methods of feature extraction and classification shown in Fig 1 are often adopted by scholars. To facilitate research for scholars in this field, we summarize some recent machine learning methods utilized in predicting enzyme classes that are novel and classic. Our summary consists of four parts. The first part briefly introduces the protein sequence set. In the second and third parts, we introduce some methods adopted in the latest papers or used frequently for feature extraction and classification. Finally, we present a statistical analysis and comment on recent published results.

II. PROTEIN SEQUENCE SET
There are some databases that contain large numbers of protein sequences, such as ENZYME (http://enzyme. expasy.org/), UniProt (http://www.uniprot.org/) and PDB (http://www.rcsb.org/). Scholars obtain sequences from these databases and use them to train and test their prediction models. To make the prediction process more rational, processing of the data sets is required [3], [12], such as deleting some sequences with high similarity [36]. Some specific tools or algorithms can be used to implement this process, such as PISCES or CD-HIT. Scholars use these processed data sets to help build prediction models [37], [38]. We summarize the source and composition of the data in Table 2 adopted by scholars in the past few years.

III. FEATURE EXTRACTION
Before classification, we need to extract the features of the protein sequences in the data set [39], [40]. Formulating the sequences using rational mathematical expressions obtains the quantification of various kinds of protein characteristics and is conducive to the classification in the next step.

A. PSEUDO AMINO ACID COMPOSITION (PseAAC)
This method evolved from the method of AAC, which counts the proportion and the rate of each amino acid, and generates a 20-dimension vector. Since PseAAC was proposed by Chou in 2001, it has been widely adopted in bioinformatics [41]- [46]. By PseAAC, a sequence P can be formulated as a vector as follows: In (1), a sequence is represented by a vector P. Among them, the first 20 elements represent the composition of 20 amino acids in a sequence and the latter λ elements represent the sequence-order information. Each element in vector P is formulated as follows: where f i is the content of amino acid i in protein P and δ j is the i-tier correlation factor calculated by (3). The ω and λ are user-defined parameters. The λ reflects the maximal distance between one contiguous residue and the other.
where correlation function (R i , R j ) can be defined as follows: where F 1 (R i ), F 2 (R i ), and F 3 (R i ) are the values of quantified physicochemical properties such as hydrophobicity, hydrophilicity and side chain mass. These three values after standardization are described as follows: where F 0 1 (R i ), F 0 2 (R i ), and F 0 3 (R i ) are the original values of quantified physicochemical properties of the residue that can be obtained from other academic theses.
For model PseAAC, the setting of parameters ω and λ is critical in that different values of these two parameters often lead to different accuracy of final classification in the case of using the same classification algorithm. In addition, we can add more quantifiable indexes about protein character to correlation function (R i , R j ), such as F 0 4 (R i ), F 0 5 (R i ) and so on, to obtain a vector helpful for classification.

B. PSEUDO POSITION SPECIFIC SCORING MATRIX (PsePSSM)
This method depends on the Position Specific Scoring Matrix (PSSM), which represents the changes in amino acids at specific positions in a protein during the long process of evolution. A protein P can be described by PSSM as follows: For example, the element a nx represents the standardized and quantified results by estimating the degree that the nth amino acid in the peptide segments converted into amino acid x and n is the number of amino acids and the rows of the matrix P pssm represent 20 amino acids in nature. This matrix is obtained by searching a comprehensive sequence database by PSI-BLAST [55], [56].
To let the matrix of PSSM become a vector, we represent a column by the average value of this column of the matrix, as described in the following formula: where However, Eq. (7) cannot represent the sequence-order information. To express the amino acid order information in a sequence, the vector PsePSSM is proposed and given by: where k is the parameter that needs to be set. It is noteworthy that the value of parameter N is the minimum number of amino acids in the sequences and the value of k must be smaller than N . When the condition k = 0 appears, (9) degenerates into (7).

C. TOPOLOGICAL INDICES
This method has been recently used to predict enzyme function. The protein sequence is represented in a star graph (SG) designed by Milan et al. and it is widely used in the field of bioinformation. The primary structure of the protein is expressed by a star graph. We use the distance matrix and the degree matrix to describe a star graph. We calculate particular topological indices from a star graph to characterize a protein sequence. The software S2SNet can accomplish this process. The results file from S2SNet contains a series of indices as shown below: Trace of the n connectivity matrices (Trn): where M is the graph connectivity matrix and n is the power of matrix M and ii represents the ith diagonal element. Schultz topological index(s): where d ij is the element of the distance matrix and deg i is the elements of the degree matrix. Besides the two simple indexes mentioned above, there is an extra set of indicators: the features of a sequence and the different feature vectors composed of different indices selected and given different weights resulting in different prediction results.

D. TORSION ANGLES
Information for the amide plane can be used as structural features. Only the single bond formed by carbon atom α can rotate in the peptide chain, so it is the root cause of peptide chain curling and folding. We use two angles, ϕ and ψ, and call them torsion angles to describe the rotation angle of the peptide plane produced by carbon atom α.
Owing to ϕ and ψ ∈ [−180 • , 180 • ], we use the probability density of the torsion angles with equally sized bins based on the 2D sample histogram and smoothed with a 2D Gaussian kernel to represent this information. A matrix of 19 × 19 bins is used to describe the range of angles ϕ and ψ. The value of each bin represents the frequency information of the torsion angles. Finally, a feature vector used to describe the distribution of different angles is obtained.
We can also select a specific number of amino acid sequence fragments in a protein and use 19 × 19 bins to represent the information of an amino acid. Finally, the peptide bond information of each amino acid is combined as a feature of a protein.

E. FUNCTION DOMAIN (FunD)
A protein can be divided into several fixed modules or regions, called functional domains, which can often be found in other proteins. The information as to whether these fixed modules or regions appear in a protein or not can be regarded as the characteristics of a protein. There are many databases, such as Pfam, HMMER, SMART, COG, KOG and CDD, to help us search a query protein.
First, use a specific program such as RPS-BLAST to retrieve functional domain information of enzymes from the database. Second, the protein P can be formulated as: where n is the number of protein domains determined by the selected database and T is the transpose operator. Element D i can be obtained by (14).

F. COMPOSITION, TRANSITION AND DISTRIBUTION (CTD)
With this method, a sequence is described from the following three different perspectives [57]. First is the specific amino acid content, defined as: where N is the total number of amino acids and A is the selected one with specific physicochemical properties. The second perspective is the combination content with two amino acids selected and it is calculated as follows: In (16), AB and BA represent a combination mode, where the amino acids A and B meeting specific conditions are next to each other. The third one calculates the distribution of amino acids with specific residues in the sequence, as shown below: which is the proportion of type A amino acids in the first number of N i amino acids of the protein sequence.

IV. CLASSIFICATION METHODS
With the development of enzyme function prediction, the classification process becomes more and more complex. However, no matter how abstruse the classification algorithm is, it is basically composed of the classic classification algorithms in different ways. Here, we briefly introduce some common classification methods. VOLUME 8, 2020

A. LINEAR DISCRIMINANT ANALYSIS (LDA)
In a sense, a linear classifier is a dimension reduction operation. The specific operation uses the following transformation to map points in a multidimensional space to a line: where x and y separately express the coordinates of high dimensional space and the result of the projection of vector x in one-dimensional space. Then, we compare the result y with the threshold value we set, and determine the category of vector x by comparing the result as shown below: Here, we take two categories as an example, where C 1 and C 2 are categories and y 0 is the threshold value. Different values of ω in (18) lead to different results. We need to maximize the distance between classes and minimize the distance within classes to get the best effect. According to the Fisher linear discriminant, ω can be obtained by the following formula: where m 1 -m 2 represents the distance between classes and S ω represents total variance within classes.

B. ARTIFICIAL NEURAL NETWORK (ANN)
ANN, as a mathematical model to simulate the processing mechanism of complex information in the human brain, is very suitable as a classifier. It is often used in the classification of enzymes and other bioinformatics problems [16], [58], [59]. The neural network is composed of many computing units called neurons. By adjusting the connection between a large number of internal neurons and synthesizing the operation results of all neurons, the purpose of information processing can be achieved. Each neuron carries a threshold (θ ), an activation function (F), and a set of weights (ω) for the input data. The output of each neuron can be obtained by the following formula: where n is the number of inputs. By using back propagation arithmetic based on a strategy of gradient descent, the thresholds (θ) and weights (ω) of each neuron are constantly modified until it has a satisfactory classification ability. With improvements in the technical applications of ANN, many ANNs with specific structures are designed according to different situations, such as CNN and LNN.

C. K-Nearest-Neighbor (KNN)
This is a relatively simple algorithm compared with other algorithms, and its logic is uncomplicated and direct: classifying a sample into the category belonging to the closest sample to it. The algorithm consists of the following 4 steps: (1) Calculate all distances from each point in the known class to the current point. (2) Sort the values of distances from the previous step by increasing the distance. (3) Define a range around the point to be classified, which contains k other samples. (4) Determine the category to which most of the k samples belong, and then classify this sample into this category. In this algorithm, the k, representing the number of the samples around point to be classified, is very important because it directly affects the effect of classification and if its value is small, the noise in the sample will have a great impact on the classification. On the other hand, if the value of k is larger, there will be a larger classification error.
There are also many improved algorithms based on KNN, such as ML-KNN and OET-KNN.

D. SUPPORT VECTOR MACHINE (SVM)
As a classic two-classification model, SVM is widely used in bioinformatics [60]- [78]. The basic idea of SVM is finding a hyperplane to segment samples meeting the condition of the sum of the distance from two kinds of samples from which it is the farthest.
The process of classification can be simply described as: where xi is the vector of sample i, for which the class is determined by the result of the formula ω T x i + b. For finding the sum of the distance, the following conditions are derived: However, it is difficult to solve (22) directly. SVM is divided into hard margins and soft margins, because the samples cannot be separated completely in the actual classification process. As a result, usually one adopts the strategy of a soft margin and solves the formula whose solution is the same as (22), as follows: where α is trained by maximizing the Lagrangian expression and k represents the kernel function. Finally, the function for the solution is obtained: Two parameters, C and g, need to be set before the classification process, which can be calculated by software such as LibSVM.

V. EVALUATION AND COMMENT
In addition to the above aspects, the measures of evaluating the prediction result and the performance of a predictor is worth mentioning. A statistical analysis and comments on recent published results follows.

A. EVALUATION MEASURE
It is often not objective to measure the result of a prediction only by its accuracy. The following metrics are usually adopted to evaluate prediction quality [4], [11], [79]- [91]: where TP, FP, FN represent positive class determined as positive class, negative class determined as positive class, and positive class determined as negative class, respectively. In addition, other indicators such as ROC and AUC are often used to evaluate the prediction results. For data set processing, there are three strategies: independent test, n-fold cross validation test, and jackknife test [92]- [103]. Of these strategies, jackknife is the most reasonable and widely used. Its process is to (1) divide the original data set into many parts, (2) select one part as the testing set and the rest as the training set, and (3) integrate the results of each testing set. As shown in (26) below: where data set X is divided into n parts and x i is the ith training data set.

B. COMMENT ON PUBLISHED RESULTS
Here, we select the theses listed in Table 2 and evaluate them in detail.
In 2007, Shen et al. [33] adopted OET-KNN to predict enzyme function. In the process of prediction, first they used FunD and PseAAC to extract features and obtained testing sets S1 and S2. Second, they used OET-KNN to classify S1 and S2. Finally, they fused the classification results from S1 and S2 to obtain the final prediction results. The results showed that the ability to predict the enzyme in the subclass of oxidoreductases is somewhat poor, with a success rate by the jackknife test of 86.7%. In 2014, Wang et al. [34] adopted several different methods in feature extraction and classification of feature extraction methods to match one classification method and compared the prediction results of four prediction models. They found that the best prediction model is the combination of RAkEL-RF and CTD, with which the highest accuracy with 10-fold cross validation of the training data reached 97.99% and the test data reached 97.57%. In 2014, Zou et al. [47] used three methods to extract features and classify them and then compared the classification results. The best results with the jackknife test have an accuracy of 91.64%. It is worth noting that the classifier designed by Zou et al. can classify multi-label enzymes. In 2019, Zou et al. [54] developed the prediction model MIDEEPre, which combines three methods of feature extraction. It can also classify multi-label enzymes.
There are also some simple prediction models with good classification results. In 2019, Concu et al. [32] separately used LDA and ANN methods for classification with only four kinds of characteristics and the topological indices calculated from software S2SNet. Both achieved high accuracy (98.73% in LDA and 100% in ANN), and the models were validated using the cross-validation tool. Their prediction model is also one of the few designed to divide enzymes into seven main classes. In another thesis [51], the authors used more topological indices in a predictor model and obtained the best overall accuracy of 91.2%, which can classify enzymes into the level of subclasses. Not only are the order information of amino acids or the information about amino acid residues in a protein sequence featured, but the 3D structure of the proteins can also be extracted. Amidi et al. [48] designed a coordinate system to represent the 3D structure of proteins. He used CNN to classify a test set and achieved an overall accuracy of 77.6%. The accuracy was relatively low, perhaps because only the 3D structure of the protein was adopted as the feature. The two torsion angles of the polypeptide chain ϕ and ψ are also an embodiment of the protein 3D structure. In 2019, Amidi et al. [49] extracted two kinds of features, the two torsion angles and similarity quantification. They fused the features in two strategies, the feature and the decision, classified by SVM and NN. The best result was achieved by the strategy of decision-level fusion and combining the classification method of SVM and NN. The best overall accuracy was 85.4%, and their predictor model can also classify multi-label enzymes. In the same year, Gao et al. [50] extracted features using PSSM and 3D Structure and then used three CNN to complete three feature maps. Finally, KNN was used for classification. Representing the protein feature information by a feature map in this prediction model is a novel approach. The best overall accuracy was 92.34%. Dalkiran et al. [52] combined three prediction models: SPMap, Blast-KNN, and Pepstats. He gave different weights to the prediction results of each prediction model, and then combined the results as the prediction results. The precision of prediction was 99%. The strength of this prediction model lies in that it can classify enzymes to the level of substrate class. There are also predictors involved in the classification of multifunctional enzymes. In 2016, Che et al. [53] adopted the ACC method developed from  Table 3 shows the information of these prediction models mentioned in this theses more intuitively. In this table, if the accuracy or precision are not clearly given in the theses, we use ''none'' to represent it.

VI. CONCLUSION
For the classification of enzymes, the functions of different prediction models are not identical. For example, some models support the classification of multi-labeled enzymes, and some models can classify an enzyme to the level of subclass. Therefore, it is meaningless to compare the accuracy of one model to another.
We find that in the prediction models described in these recent papers, the algorithm of ANN is used frequently. This implies that with the increase in research, the function of the prediction model becomes more and more powerful, and many traditional classification algorithms such as SVM and RF cannot meet the increasingly complex classification requirements. Since the classification algorithm is designed based on the fixed mathematical principle, its performance is always stable. Therefore, the key to improve the classification effect is to adopt better feature extraction methods. This view has been confirmed in many prediction models. For instance, the model DEEPre uses five methods of feature extraction.
The types of feature extraction can be roughly divided into three categories. The first category extracts information from the sequence, such as AAC and PseAAC. The second category is about the structure information of proteins; for example, the representation of 3D structure and the torsion angles. The third category is the result of comparing the protein with the information in the corresponding database, such as FunD and PSSM. It should be noted that although structure is more thorough in describing protein characteristics, the features from structural information do not necessarily contribute to classification.
For many years, the development of feature extraction focused on quantifying the primary structure of proteins, such as PSSM, PseAAC, FunD, and CTD. Recently, some scholars have tried to adopt some specifically designed algorithms such as torsion angles to extract some secondary structure information for the protein. However, the key to the function of a protein is the physicochemical properties caused by its different spatial structure. That is, the tertiary structure of a protein is the crux to accurately judge its different functions. The complexity and unpredictability of advanced protein structure will be the bottleneck in the development of enzyme function prediction.
In brief, we believe that a satisfactory classification result can be obtained by using the features synthetically. Moreover, with the development of bioinformatics, more feature extraction methods will be found [104]- [107], and the classification effect will be further improved.

AUTHOR CONTRIBUTIONS
Yuming Zhao conceived and designed the project. Zhiyu Tao conducted experiments and analyzed the data. Zhiyu Tao and Benzhi Dong wrote the paper. Zhixia Teng and Yuming Zhao revised the manuscript. All authors read and approved the final manuscript.