TM-ZC: A Deep Learning-Based Predictor for the Z-Coordinate of Residues in α-Helical Transmembrane Proteins

Z-coordinate is an important structural feature of <inline-formula> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula>-helical transmembrane proteins (<inline-formula> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula>-TMPs), which is defined as the distance from a residue to the center of the biological membrane. Since the <inline-formula> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula>-TMP structures from both experimental solved and computational predicted approaches still cannot cover the requirements in relevant research fields, z-coordinate prediction provides an opportunity to partly descript <inline-formula> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula>-TMP structures based on their sequences, further contributes to function annotation and drug target discovery. For the purpose of improving the prediction accuracy and providing a convenient tool, we proposed a deep learning-based predictor (TM-ZC) for the z-coordinate of residues in <inline-formula> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula>-TMPs. TM-ZC used the one-hot code and the PSSM as input features for a convolutional neural network (CNN) regression model. The experimental results demonstrated that TM-ZC was a powerful predictor, which is simple and fast, and achieved a considerable performance: the average error was 1.958, the percent of prediction error within 3Å was 77.461%, and the correlation coefficient (CC) was 0.922. We further discussed the usefulness of TM-ZC predicted z-coordinate and found its high consistency with topology structure and the enhancement of the surface accessibility prediction.


I. INTRODUCTION
α-helical transmembrane proteins (α-TMPs) are the major category of transmembrane proteins (TMPs). According to the statistics of the Universal Protein Resource (UniProt) [1], α-TMPs account for more than 98% of the TMPs. α-TMPs play numerous roles in basic physiology and pathophysiology, including signal transduction [2], nutrients or drugs reception [3], immune response [4], and enzyme activation [5]. Malfunction of α-TMP may cause many diseases, such as autism [6], epilepsy [7], and cancer [8]- [11]. Consequently, α-TMPs are the major targets for more than half The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou . of known drugs, the detailed structure of them would be paramount to the success of drug discovery [12], [13]. Unfortunately, despite their important biological functions, determination of high-resolution structures of α-TMPs persist technical difficulties, only approximately 5% of them are determined.
For this reason, the TMP relevant researches are promoted currently by means of many structural descriptors abstracted from primary sequences. Beyond high-resolution structural information, some low-resolution structural descriptors, such as topology structure, surface accessibility, and z-coordinate, can also provide valuable information about α-TMPs. In recent years, a lot of illuminating methods have been proposed and accessed great achievements. Such as the topology structure prediction methods of α-TMPs [14], [15], especially, S. H. Feng et al. firstly developed a multiscale deep learning protocol (MemBrain 3.0) that includes two submodules: transmembrane helix prediction and orientation prediction [16]. Likewise, several methods have been developed to predict the surface accessibility of α-TMPs and achieved considerable performance [17], [18]. For example, our previous work [19] presented a deep learning-based predictor (TMP-SSurface), which combined the Inception and the CapsuleNet by using one-hot code and PSSM as input features.
Z-coordinate of a residue in α-TMP is defined as the distance from the residue to the center of the membrane [20]. Similar to the topology structure, z-coordinate also reflect the relationship between the residue and the membrane, but by continuous numerical measurement. The z-coordinate is highly correlated with the ligand-binding and the proteinprotein binding regions because these binding regions are always specifically located on transmembrane, water-soluble, or junction regions. The predicted z-coordinate is helpful for the topology prediction [21], structural classification [22], burial status prediction [23], and many other research fields [24], [25]. Accurate predicting the z-coordinate of residues in α-TMPs by computational methods is not only an intermediate step towards structure determination, but also a potential property that may assist the function annotation, drug target discovery, and other associated problems [21], [26], [27].
However, the z-coordinate study has not received as much attention as the study of topology structure and surface accessibility. ZPRED [20], only one z-coordinate predictor published more than a decade, where Artificial Neural Network (ANN) and Hidden Markov Model (HMM) were combined, and sequential features were used as inputs. ZPRED is the pioneering work on the z-coordinate prediction, but its webserver is no longer available. To further support α-TMP research, a z-coordinate predictor is surely needed, which should be more reliable and with high accuracy and performance. In the past 15 years, the number of α-TMPs' structures has increased more than 10 times. More data can be used to improve this research, and deep learning method provides a novel opportunity to construct a more powerful predictor that is simpler and faster, and achieve higher performance at the same time.
In this work, we proposed a deep learning-based predictor (TM-ZC) for the z-coordinate of residues in α-TMPs. TM-ZC used the one-hot code and the PSSM as input features for a convolutional neural network (CNN) regression model. The experimental result proved a considerable performance of TM-ZC. The average error was 1.865Å, the present of prediction error within 3Å was 76.703%, and the correlation coefficient (CC) was 0.917. Besides, We also tested the usefulness of TM-ZC predicted z-coordinate on the problems of surface accessibility prediction and topology structure prediction. We tried to distinguish the transmembrane residues from non-transmembrane residues by limiting the threshold of the z-coordinate predicted by TM-ZC.
The experimental result demonstrated that there exists a strong correlation between the topology structure and TM-ZC predicted z-coordinate. For surface accessibility prediction, we added the TM-ZC predicted z-coordinate as an additional feature of our previous work to predict the surface accessibility of α-TMPs. The experimental result proved that TM-ZC predicted z-coordinate could enhance the prediction performance. A stable webserver is accessible freely in http://icdtools.nenu.edu.cn/TM-ZC.

A. BENCHMARK DATASETS
A dataset used by ZPRED was constructed in 2005 that consisted of 101 non-homologous chains from 46 complexes. We believe that a more comprehensive benchmark dataset is urged. The Protein Data Bank of Transmembrane Proteins (PDBTM) [28] is the most widely used comprehensive data bank for transmembrane proteins. It was created by scanning all PDB entries with the TMDET algorithm [29]. According to the statistics of PDBTM, the number of α-TMPs has increased more than 10 times in the past 15 years. We downloaded 3820 complexes with 13,209 α-TMP sequences from PDBTM (version: 2019-05-10). Removing the sequences that contain residues other than 20 standard amino acids. Removing the short sequences with residues less than 30 because they are always considered as peptides. To reduce the negative effect of homology bias [30], we clustered the rest of the proteins by running CD-HIT with a 0.3 sequence identity cut-off, and the longest sequences in each cluster were collected. After pre-processing, 851 α-TMPs with 223,310 residues were left. Among them, 50 sequences were randomly selected as the independent testing dataset (ZC-test50) for the independent test to verified the robustness of TM-ZC. The remaining 801 sequences were used for training and tuning the prediction model, among them, 50 sequences were randomly selected as the validation dataset (ZC-valid50), and the remaining 751 sequences were built as the training dataset (ZC-train751). The process of selecting the validation dataset were performed for ten times for ten-fold cross validation. The performance reported in this work when training the models was the average performance of sub-models in ten-fold cross validation. All the datasets that used in this work can be found in Supplementary Materials.

B. CALCULATION OF Z-COORDINATE
The original coordinates of residues recorded in the PDB files need to be rotated and moved according to the relative positions of the protein and the membrane. The observed z-coordinate value can be calculated by using Formula 1: where [x i , y i , z i ] represents the original coordinate of the alpha-carbon atom of the ith residue recorded in the PDB files obtained from PDBTM, A is the matrix that rotated the protein 40130 VOLUME 8, 2020 perpendicular to the membrane. b x , b y , b z is a vector that moved the protein to the right place related to the membrane. Both A and b x , b y , b z were obtained from TMDET [29]. The z-coordinate of the membrane center is 0. Then, as the same process of ZPRED, two steps of threshold cutting was performed: Step 1: We took the absolute value of z i that limiting the threshold from (−∞, +∞) to [0, +∞), considering only the distance between the residues and the membrane center without the orientation. The absolute value can be calculated as: Step 2: Limiting z i in range [5,25] so that all the residues with z i between 0 and 5 were defined to be in a central hydrophobic region and the z-coordinate values were set to 5, all the values above 25 were set to 25 because they were considered as non-transmembrane residues. The z-coordinate values within the range [5,25] were considered as the observed z-coordinate labels of TM-ZC. It can be calculated as:

C. ENCODING OF PROTEIN FRAGMENTS 1) EVOLUTIONARY CONSERVATION (PSSM)
In the process of evolution, certain genetic characteristics of proteins have become increasingly prominent among homologous proteins. It has been proved that fragments with high evolutionary conservation always related to the structural or functional needs of the proteins [31], [32]. The position-specific score matrix (PSSM) is an effective descriptor extracted from the result of multiple sequence alignment [33]. The PSSM of a given protein was calculated by running PSI-BLAST [34] against the UniRef50 database (released on October 16, 2019) with e-value threshold 0.001 and 3 iterations. The PSSM of a protein can be defined as a 20 × L matrix: where P i,AA j represent the element's value of PSSM, which represents the occurrence frequency of AA j at the i-th position of the given protein in the result of multiple sequence alignment. L represents the length of the protein. Then, we used the logistic function to normalized each element of PSSM into [0, 1]:

2) ONE-HOT CODE
One-hot coding is a sparse coding method used to represents the type of each residue in a protein sequence. It is the most direct way to describe the protein sequence, reflecting the most primitive arrangement information of 20 standard amino acids. It proved to be a valid feature for deep learning-based protein function predictors [35]- [39]. The one-hot code of a protein can be defined as a 20 × L matrix: where O R i ,AA j represents the element's value of one-hot code. R i is the type of residue on position i. AA j is the type of 20 standard amino acids.
L represents the length of the protein.

3) SLIDING WINDOW
Residues are not isolated in the protein sequence but are arranged in a certain order [40], [41]. The position and function of a target residue are greatly affected by its adjacent residues. Hence, we employed a sliding window scheme that presents the target residue by a protein fragment. Here, we set the window size to 25: target residue with 12 residues from upstream and 12 residues from downstream. The target residue was the center of the protein fragment. At the terminal of the sequence, the features' values of the overflow part of the sliding window were filled with 0. At last, we got a 40×25 matrix as the feature of each residue.

D. MODEL DESIGN
The convolutional neural network (CNN) is a kind of feedforward neural network, in which the neurons can reflect the surrounding information within the coverage of the convolution kernel. The training dataset was used to training the network in each iteration, and the validation dataset was used to validate the performance of this iteration and feedback to the next iteration. It solves the problem of traditional machine learning's dependence on manual features and can learn useful features directly from primitive data. CNN performs great in the field of image and video recognition [42], natural language processing [43], and medical diagnosis [44]. Due to its effectiveness, CNN has been widely used in the field of bioinformatics, such as super-enhancer prediction [45] and drug-disease association prediction [46]. In this work, our goal was to make the prediction model as simple as possible. So we constructed a small network architecture. The design of the prediction model is shown in Fig. 1. All the convolution layers contained 256 kernels with the size of 3 × 3 and the stride of 1, and the activation function was ReLU. All the pooling layers were max pooling with 256 kernels with the size of 2 × 2 and the stride of 2. In this model, one convolution layer was first performed to extract features VOLUME 8, 2020 from the original encoded feature matrix. Then, one max pooling layer was performed. After that, two convolution layers, one max pooling layer, two convolution layers, and one max pooling layer were performed in order. At this time, the size of the feature matrix was 3 × 5, it was too small to perform the following layers. Thus, we performed a zeropadding layer to padding the feature matrix to the size of 3 × 7. Then, two convolution layers and one max pooling layer were performed. At this time, we extract features as 256 matrixes with the size of 1 × 3. We flatten them in a line, and a full-connected neural network was performed. The input layer contains 768 neurons, the hidden layer contains 256 neurons, and the output layer contains just one neuron. The opportunity of a target residue that belongs to the output neuron was considered as the prediction result.

E. PERFORMANCE EVALUATION
In order to measure the prediction performance of TM-ZC, three indicators were used to reflect it: the mean absolute error (MAE), the Pearson correlation coefficient (CC), and the percent of results that error less than 3Å (P 3Å ). MAE reflects the average deviation between the predicted and observed z-coordinate of all residues. It ranged in [0, 1], the smaller the MAE value, the better the performance. CC reflects the linear correlation between predicted and observed z-coordinate. It ranged in [−1, 1], the more the CC close to 1, the better the performance. P 3Å reflects the ratio of the ideal prediction results, the threshold inherited from ZPRED. It ranged in [0%, 100%], the higher the ratio, the better the performance.
where L represents the number of residues, x i and y i represent the observed and predicted z-coordinate of the ith residue, and x andȳ represent the corresponding mean value, N represents the number of residues that prediction error less than 3Å.

A. FEATURE ANALYSIS
In order to investigate the effectiveness of different kinds of features and their contribution to the prediction model, we performed the ablation study on features. Three models were built by using one-hot code, PSSM, and both of them. As shown in TABLE 1, PSSM outperforms one-hot code. This phenomenon reflects the strong correlation between evolutionary conservation and protein structure. Although the model using one-hot code alone performed poorly, it complemented the PSSM feature that the model achieved the best performance while using two features together.

B. EFFECT OF WINDOW SIZE
A sliding window scheme was used in this work, and the value of the window size greatly affected the prediction performance of TM-ZC. We test the possible value of window size from 15 to 31 with the step size of 2. The prediction performance on the validation dataset while training TM-ZC

C. EFFECT OF CUTTING THRESHOLD
As described in the ''Section II-BCalculation of Z-coordinate'', the original coordinates of residues recorded in the PDB files were rotated and moved according to the relative positions of the protein and the membrane. Then, two steps of threshold cutting were performed. The first step took the absolute value that cutting the threshold from (−∞, +∞) to [0, +∞). The second step limited the value in the range [5,25]. Three training labels with different cutting thresholds were obtained, and three models were built by using different labels. The performance of these models on the validation dataset (ZC-valid50) is illustrated in TABLE 3. It is obvious that the performance of the model using the original labels with the threshold of (−∞, +∞) was poor, and the other two models significantly outperformed it. There proposed a possible explanation for this phenomenon: Residues that are symmetrical about the membrane center always have similar features while their labels are opposite to each other, which makes it difficult for the prediction model to find the relationship between the features and the corresponding labels. It can also be seen that the models using the labels with the threshold of [5,25] achieved the best performance.

D. PERFORMANCE OF TM-ZC
In order to investigate the performance of TM-ZC and verified its stability, we performed ten-fold cross validation.   Fig. 2 shows the performance of the TM-ZC in terms of (a) MAE value and (b) CC value. As shown in (a), the mean value of MAE of ten models is 2.215, the deviation between the MAE of each sub-model and the mean value is shown as the label value. As shown in (b), the mean value of CC of ten models is 0.915, the deviation between the CC of each sub-model and the mean value is shown as the label value. Fig. 2 exhibits that TM-ZC performed stably among cross-validation.

E. COMPARE WITH ZPRED
ZPRED is the only predictor in existence. We compared the TM-ZC with ZPRED to investigated its effectiveness. Fig. 3 illustrates the mean absolute error (MAE) of ZPRED and TM-ZC in different z-coordinate regions. It could be seen that TM-ZC outperforms ZPRED in most of the z-coordinate regions, especially in regions with low z-coordinate value. A low z-coordinate indicates that the residue locates in the hydrophobic transmembrane region. It proved that TM-ZC performed great for residues in the transmembrane region. In the region [19.5, 24.5), the advantages of TM-ZC became less obvious and even worse than ZPRED. It shows that TM-ZC's prediction performance for residues in the junction area on the membrane surface needs to be further improved. The overall prediction performance comparison between ZPRED and TM-ZC is illustrated in TABLE 4. It is obvious that TM-ZC outperformed ZPRED. The MAE reduced by approximately 43%, the percent of results that error less than 3Å (P 3Å ) increased more than 28% and the increment of the Pearson correlation coefficient (CC) up to 0.1.

F. CASE STUDIES
We performed case studies to demonstrate the effectiveness of TM-ZC further. 6EU6_A and 5L25_A from ZC-test50 were chosen as examples. 6EU6_A is an eleven-transmembrane protein from Escherichia Coli. It is the target of ATP, Dodecyl-Alpha-D-Maltoside, and other ligands. 5L25_A is a ten-transmembrane protein from Saccharomyces Cerevisiae. It plays an important role in the process of anion exchange and borate transport. The prediction results of both proteins are illustrated in Fig. 4.
In Fig. 4: (a) visualized the TM-ZC predicted z-coordinate of 6EU6_A, the darker the color, the larger the precited z-coordinate value. (b) is the curve of the z-coordinate of 6EU6_A, including the observed value, the ZPRED predicted value, and the TM-ZC predicted value. (c) and (d) illustrated the same information of 5L25_A as (a) and (b). It could be seen from (a) and (c) that the z-coordinate predicted by TM-ZC is basically in line with the actual situation. It is further confirmed in (b) and (d) that the curve of the TM-ZC predicted value fits the curve of the observed value to a high degree, and it is better than ZPRED.

G. Z-COORDINATE CORRELATED WITH TOPOLOGY STRUCTURE
The z-coordinate directly relates to the topology structure of residues in TMPs. We tried to distinguish the transmembrane residues from non-transmembrane residues by limiting the threshold of the z-coordinate predicted by TM-ZC. This experiment was performed on the ZC-test50 dataset and achieved the highest accuracy (79.268%) at the threshold of 14.5Å. The residues with z-coordinate within 14.5Å were considered as transmembrane residues. That means the mean thickness of the membrane was 14.5 × 2 = 29Å, and the biological experiment proved that the mean thickness of the membrane is about 30Å [47]. Therefore, the threshold of TM-ZC predicted z-coordinate is in line with facts, and there exists a strong correlation between topology structure and z-coordinate.

H. TM-ZC ENHANCE THE PREDICTION OF SURFACE ACCESSIBILITY
In addition to having a strong correlation with the topology structure, the TM-ZC predicted z-coordinate could also enhance the prediction of the surface accessibility of α-TMPs residues. Our team once proposed a predictor (TMP-SSurface) for the relative accessible surface area (rASA) prediction of residues in α-TMPs [19]. TMP-SSurface used one-hot code, terminal flag, and PSSM as the input features of a deep learning-based regression method. We added TM-ZC predicted z-coordinate as an additional feature to verified the usefulness of TM-ZC and we were glad to find that the value of the Pearson correlation coefficient (CC) increased from 0.581 to 0.604. This experiment proved that there is also a correlation between z-coordinates and rASA.

IV. CONCLUSION
The z-coordinate of a residue in α-TMPs is defined as the distance between the residue and the membrane center. It is an important structure descriptor that highly correlated with the function regions of α-TMPs. Up to now, ZPRED is the only one predictor for this problem and needed to be further improved. In this work, we proposed a deep learning-based predictor (TM-ZC) to predict the z-coordinate of residues in α-TMPs. TM-ZC is a simple CNN-based predictor that used one-hot code and PSSM as input features. TM-ZC achieved great performance that the MAE was 1.958, the CC was 0.922, and the percent of prediction error within 3Å was 77.461%. Experiments showed the contribution of two kinds of feature, and find that PSSM features were more powerful. We also verified the correlations between TM-ZC predicted z-coordinate and tested the usefulness of TM-ZC predicted z-coordinate on the problems of surface accessibility prediction. Experiments proved that TM-ZC could provide effective support for related problems. We are confident that TM-ZC can be further used in more researches on transmembrane proteins.
YINGLI GONG was born in Shandong, China, in 1998. She is currently pursuing the B.S. degree in computer science and technology with Northeast Normal University, Changchun, China. Her current research interests include data mining, machine learning, and bioinformatics.
ZHE LIU was born in Zhejiang, China, in 1997. She is currently pursuing the bachelor's degree with the School of Information Science and Technology, Northeast Normal University, China. She joined the Institution of Computational Biology, Northeast Normal University, in 2019, where she participated in some research on transmembrane protein structure prediction using deep learning methods.
YUANZHAO GUO was born in Jilin, China, in 1998. She is currently pursuing the B.S. degree with the School of Information Science and Technology, Northeast Normal University, China. Her research interests include bioinformatics and data mining.
ZHIQIANG MA received the Ph.D. degree from the School of Computer Science, Jilin University, in 2009. He is currently a Professor with the School of Information Science and Technology, Northeast Normal University. He is also the Vice President of the Research Association of Computer Education, Normal Universities of China, and the Executive Director of the Jilin Computer Federation. His interests include bioinformatics, software engineering, molecular biology, and data mining HAN WANG is currently the Director of the Institution of Computational Biology, Northeast Normal University, China. Major in transmembrane protein structure and function researching using artificial intelligence methods, which extend to the researches, including big biological data, proteinprotein interaction, new drug target discovering, and drug affection. He has published many peerreviewed articles in these fields, funded by the National Natural Science Foundation of China, Jilin Scientific, and Technological Development Program of China. VOLUME 8, 2020