GANcon: Protein Contact Map Prediction With Deep Generative Adversarial Network

Accurate protein contact map prediction is essential for de novo protein structure prediction. Over the past few years, deep learning has brought a significant breakthrough in protein contact map prediction and optimized deep learning architectures are highly desired for performance improvement. As an emerging deep learning architecture, the generative adversarial network (GAN) has shown the powerful capability of learning intrinsic patterns, which inspires us to comprehensively exploit GAN for predicting accurate protein contact maps. In this study, we present GANcon, a novel GAN-based deep learning architecture for protein contact map prediction, which to the best of our knowledge is the first GAN-based approach in this field. Instead of using a single neural network, GANcon is composed of two competitive networks that are evolving through adversarial learning. The generator network employs a dedicated encoder-decoder architecture that can efficiently capture the underlying contact information from versatile protein features to generate contact maps, while the discriminator network learns the differences between generated contact maps and real ones and promotes the generator network to produce more accurate contact maps. Moreover, to deal with the imbalance problem and take into account the symmetry of contact maps, we also propose a novel symmetrical focal loss, which can further enhance the effectiveness of adversarial learning for better performance. The experimental results on several datasets demonstrate that GANcon outperforms many state-of-the-art methods, indicating the effectiveness of our method for predicting protein contact maps. GANcon is freely available at https://github.com/melissaya/GANcon.


I. INTRODUCTION
Proteins are crucially important macromolecules in an organism and play a fundamental role in almost all biological processes. In order to carry out the essential cellular function, proteins fold into specific three-dimensional structures, which are driven and stabilized by the interactions between The associate editor coordinating the review of this manuscript and approving it for publication was Hualong Yu . protein residues, i.e., protein residue contacts. For a protein sequence, all contacts of residue pairs can be encoded into a binary matrix named 'contact map', which has been regarded as a critical contributor for accurate de novo protein structure prediction [1], [2]. In recent Critical Assessment of protein Structure Prediction (CASP) experiments, many excellent de novo protein structure prediction methods have benefited much from the incorporation of predicted contact maps [3], [4]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Due to the importance of contact map in protein structure prediction, researches on predicting protein contact maps have been booming in the past decade. For example, the evolutionary coupling analysis (ECA) methods predict contacts by capturing co-evolved residues from protein multiple sequence alignments (MSAs), such as CCMpred [5] and FreeContact [6]. These methods are effective for predicting contacts in proteins with a large number of high-quality MSAs, while their predictive performance is limited if the proteins have few or low-quality MSAs [7], [8]. In contrast, machine learning methods can learn complex relationships from all kinds of information, including the co-evolutionary information estimated by ECA methods, and therefore have been more successful in contact map prediction [9].
Over the past few years, as a powerful machine learning technique, deep learning has brought a significant breakthrough in protein contact map prediction [8], [10], [11]. A typical deep learning method usually adopts a multilayer convolutional neural network (CNN) architecture to learn inherent patterns in contact maps automatically. For example, DNCON2 is composed of six CNN blocks [12], RaptorX-Contact uses deep residual convolutional network (ResNet) [11], and SPOT-Contact combines ResNet with two-dimensional residual bidirectional recurrent long shortterm memory networks [8]. These carefully designed deep learning architectures have shown remarkable predictive power in recent CASP experiments, which inspires us to further explore more optimized deep learning architectures for performance improvement.
Most recently, as an emerging deep learning architecture, the generative adversarial network (GAN) [13] has received considerable attention due to its powerful capability of learning intrinsic patterns in diverse fields, such as image classification [14] and gene expression inference [15]. Instead of using a single deep neural network, GAN is composed of two competitive networks, namely a generator network and a discriminator network, which are evolving in an adversarial learning strategy: the generator network produces fake samples and tries to fool the discriminator network into believing the generated samples are real, while the discriminator network tries to distinguish the generated samples from the real ones and guides the generator network to produce more realistic samples. Although it is of great interest to comprehensively exploit GAN for predicting protein contact maps, there are still several issues to address. First, despite that contact map prediction can be interpreted as image classification problem at pixel level where each pixel represents one residue pair [16], the input features adopted in contact map prediction include versatile protein features such as the GaussDCA scores [17], Atchley factors [18] and log number of sequences in the alignment, which are usually more complex than the input of image classification [11]. Therefore, to produce accurate contact maps, the generator network of GAN is required to efficiently capture underlying contact information from these complex features. Second, the contact map is a binary matrix and the ratio of contact and non-contact residue pairs is extremely low (<2%) [19], which leads to a severe imbalance problem [7], [9] especially for GAN models optimized by commonly-used binary crossentropy (BCE) loss [20]. Third, symmetry is an important and unique property of contact map [21], which, however, is absent in almost all other prediction tasks and therefore not considered in existing GAN models.
In this work, we present a novel GAN-based deep learning architecture called GANcon for protein contact map prediction, which to the best of our knowledge is the first GAN-based approach in this field. The generator network of GANcon employs a dedicated encoder-decoder architecture to efficiently capture the underlying contact information from versatile protein features. Meanwhile, through adversarial learning, the discriminator network of GANcon learns the differences between generated contact maps and real ones and promotes the generator network to produce more accurate contact maps. Moreover, to cope with the imbalance problem and take into account the symmetry of contact maps, we also propose a novel symmetrical focal (SF) loss in this study, which can further enhance the effectiveness of adversarial learning for better prediction results. We assess the prediction performance of GANcon on an independent test dataset of 360 proteins, and CASP12 and CASP13 datasets. The experimental results demonstrate that GANcon shows very promising performance for all length cutoffs and sequence separations.

A. THE ARCHITECTURE OF GANcon
The overall architecture of GANcon including a generator network and a discriminator network is depicted in Figure 1. The generator network of GANcon takes the given protein features as input to produce contact maps, while the discriminator network of GANcon works as a classifier to discriminate generated contact maps from real contact maps. With both the adversarial loss and the SF loss, adversarial learning promotes the generator network to produce accurate contact maps that approximate the distribution of real contact maps. After training, the generator network will then be adopted for contact map prediction.
To efficiently capture the underlying contact information from versatile protein features, the generator network of GANcon employs a dedicated encoder-decoder architecture. As shown in Figure 1, the encoder path consists of a series of residual blocks (ResBlocks) [22], 3 × 3 convolutional layers with rectified linear unit (ReLU), and 2 × 2 maxpooling layers with stride of 2. ResBlocks utilize short skip connections to skip the block input to its output, and therefore learn a residual representation of the protein features that is helpful to capture complex relationships between residue pairs. Max-pooling layers down-sample the output of the previous layer and two dropout layers are added into the last two steps of the encoder path to prevent the network from overfitting. The decoder path consists of a series of FIGURE 1. The overall architecture of GANcon. GANcon is composed of a generator network and a discriminator network. The generator network produces contact maps that are highly similar to real contact maps while the discriminator network tries to distinguish generated contact maps from real contact maps. With both the adversarial loss and the SF loss, adversarial learning promotes the generator network to produce accurate contact maps that the discriminator network cannot distinguish from the real ones.
2 × 2 up-convolutional layers, 3 × 3 convolutional layers with ReLU and ResBlocks. Moreover, long skip connections are added between the encoder path and the decoder path to provide more different-level details about contact information. Finally, a 1 × 1 convolutional layer with a sigmoid activation function is used to produce pixel-level contact map prediction.
The discriminator network of GANcon extensively plays an adversarial role to promote the generator network to produce accurate contact maps, which comprises three 3 × 3 convolutional layers with a leaky rectified linear unit (LeakyReLU) and a 1 × 1 convolutional layer with a sigmoid activation function. Similar to the previous study [23], the discriminator network receives the concatenated pair of the generated or real contact map and the corresponding input protein features. The outputs are the pixel-level probabilities of real or generated contact residue pairs in a contact map.

B. ADVERSARIAL LEARNING
We denote the input protein features as X of size L × L × N , where L is the length of protein sequence and N is the number of protein features. The corresponding real contact map is denoted as M of size L × L × 1. The generator network of GANcon is denoted as G (·) that outputs a generated contact map M =G (X ) of size L ×L ×1. The discriminator network of GANcon is denoted as D (·) that outputs a probability matrix P = D(X , M ) or P = D(X , M ) of size L × L × 1, which includes pixel-level probabilities of contact pairs coming from a real contact map M or a generated contact map M .
During the adversarial learning process, the adversarial loss used for the discriminator network is based on BCE loss: where z = 0 if the input includes generated contact maps and z = 1 if the input includes real contact maps. P ij and P ij are the values of the i-th row and j-th column in P and P , respectively. The first term of (1) is used to classify M as real at pixel level, while the second term is used to make M to be classified as fake at pixel level.
To train the generator network, GANcon uses a loss function that is a weighted sum of an adversarial loss based on P and an SF loss between M and M , which is defined as follows: where λ is set to 1.0 to maintain the balance of adversarial learning. The adversarial loss used for generator network aims to fool the discriminator network through maximizing the probability of M being considered as real and is defined as: Although BCE loss between M and M is commonly used in existing GAN models for almost all other prediction tasks, it is not suitable for protein contact map prediction as it fails to deal with the imbalance problem caused by the low rate of contact and non-contact residue pairs and ignores the symmetry of contact maps. In order to solve these problems, VOLUME 8, 2020 we propose a novel SF loss to amend BCE loss by using both focal loss introduced by Lin et al. [20] and symmetrical loss: where β is set to 1.0 to balance the role of two loss terms. L F G is the focal loss that is effective for the imbalance problem and is defined as: where α ∈ [0, 1] is a weighting factor to adjust the importance of contact and non-contact residue pairs and γ is a parameter that puts the focus on hard and misclassified residue pairs and reduces the loss contribution of easy-to-classify residue pairs. In this study, α is set to 0.25 and γ is set to 2.0. In order to keep the symmetry of the generated contact maps as much as possible, the symmetrical loss L S G is defined as follows:

C. IMPLEMENTATION DETAILS
GANcon is implemented using the Keras library (https:// keras.io) along with Tensorflow (https://www.tensorflow.org). We use the Adam optimization method with the initial learning rate as 1E-4 in the generator network and 1E-5 in the discriminator network. In each epoch, a mini-batch size of 1 is used for both networks due to GPU memory limitation and we train the discriminator network 3 times while training the generator network once. GANcon takes approximately 15 hours (20-30 epochs) to converge with an Nvidia 1080 Ti GPU.

D. PROTEIN FEATURES
As shown in Supplementary Table S1, the protein features used in GAN-con include various two-dimensional, onedimensional and scalar features, which are consistent with those used by many other methods [24], [25]. To derive these features from MSAs, by following a similar procedure in previous methods [12], [25], we first run HHblits [26] with an E-value threshold of 1E-3 to search the Uniclust30 database [27] to generate alignments. If the alignment found by HHblits has fewer than 2000 homologous sequences, we then run JackHMMER [28] with E-value thresholds of 1E-20, 1E-10, 1E-4 and 1 to search the UniRef90 database [29].
After that, we use these alignments to generate other protein features, e.g., GaussDCA scores [17]. Finally, both onedimensional and scalar protein features are duplicated to form two-dimensional matrixes, which are used together with the two-dimensional protein features as the inputs of GANcon [12].

E. DATASETS
The dataset used in this study consists of SCOPe 2.07 subsets filtered for sequences with less than 30% sequence identity (based on PDB SEQRES records) and sequence lengths between 50 and 500 [30]. Meanwhile, the dataset is divided into three non-overlapping sets for training, validation and independent test, which is a commonly-used performance evaluation strategy in deep learning methods [31], [32].
In this way, 7192 proteins from SCOPe 2.06 are allocated to the training and validation datasets (90% and 10%, respectively) [31], while 360 proteins newly released in the SCOPe 2.07 are allocated to the independent test dataset. Moreover, we also carry out additional testing on the publicly available targets in recent CASP experiments including 22 CASP12 free modeling (FM) targets and 15 CASP13 FM targets, for an objective comparison with state-of-the-art methods.

F. EVALUATION CRITERIA
According to the standard CASP definition [33], protein residues are defined as in contact when the Euclidean distance between two C β atoms (C α for Glycine) falls within 8 Å.
All contacts are divided into three groups depending on the sequence separation, including long-range (sequence separation ≥ 24), medium-range (12 ≤ sequence separation < 24) and short-range (6 ≤ sequence separation < 12). Following the CASP routine, we take the top L/k (k = 5, 2, 1) predicted contacts, where L is protein sequence length, to calculate the precision, recall, and F1 score. These three metrics are defined as: and where True Positive is the number of correctly predicted contacts, False Positive is the number of contacts falsely predicted as non-contacts and False Negative is the number of non-contacts falsely predicted as contacts.
To further analyze all the predictions from GANcon, we provide Precision-Recall (PR) curves and Receiver Operator Characteristic (ROC) curves. The corresponding area under the PR curve (AUPRC) and area under the ROC curve (AUC) scores for long-, medium-and short-range are also provided. In consistent with previous studies [16], [34], we use P-values in the Student's t-test to compare these metrics of other methods with those of GANcon, which is a measure of statistical significance of the difference between two methods' results.

A. THE EFFECTIVENESS OF ADVERSARIAL LEARNING
In order to measure the impact of the adversarial learning in protein contact map prediction, in Table 1 we first compare the prediction precisions of models with or without adversarial learning on the validation dataset for both top-ranking prediction length cutoffs (L/5, L/2 and L) and sequence separations (long-, medium-and short-range). The model without adversarial learning is treated as the baseline, which only uses the generator network of GANcon and is trained with the commonly used BCE loss. And the model with adversarial learning (i.e., GAN) uses both the generator network and the discriminator network of GANcon and is trained with both adversarial loss and BCE loss. As shown in Table 1, the prediction performance is greatly improved for all levels of contact precisions with the help of adversarial learning. For example, for top L/5 long-, medium-and short-range contacts, the model with adversarial learning has a precision of 83.03%, 71.69% and 68.60%, respectively, which is 5.27%, 4.68% and 4.55% higher than that without adversarial learning (77.76%, 67.01% and 64.05%), respectively. The corresponding P-value in the Student's t-test is 2.56E-61, 1.02E-117 and 1.52E-30 (Supplementary Table S2), respectively, indicating that the improvement is statistically significant. In addition to precisions, we also show the F1 scores, AUPRC scores and AUC scores of different models in Supplementary Tables S3-S5. From these results, we find that adversarial learning leads to significant improvements for all length cutoffs and sequence separations. For example, the model without adversarial learning obtains an AUPRC score of 49.66%, 57.94% and 56.57% for long-, mediumand short-range contacts, respectively, which is 6.18%, 5.65% and 5.62% lower than the model with adversarial learning (55.84%, 63.59% and 62.19%), respectively. Taken together, these results indicate the effectiveness of adversarial learning in protein contact map prediction.
Furthermore, to explore the optimal loss function yielding better performance during adversarial learning, we train GANcon model using adversarial learning with the proposed SF loss. As shown in Table 1, the proposed SF loss consistently brings additional performance improvements for all length cutoffs and sequence separations. For example, the precision for top L/5, L/2 and L long-range contacts are 87.07%, 77.24% and 62.06% by SF loss, respectively, compared to 83.03%, 72.62% and 58.03% by BCE loss, respectively. Also, the corresponding P-values shown in Supplementary Table S2 suggest the performance improvements obtained by using SF loss are all statistically significant. Similar results in F1 scores, AUPRC scores and AUC scores can also be observed in Supplementary Tables S3-S5, which further demonstrates that the proposed SF loss is indeed effective for protein contact map prediction. Overall, by jointing SF loss with adversarial learning, GANcon can successfully boost the precision by 9.31%, 7.88% and 8.62% for top L/5 long-, medium-and short-range contacts.

B. COMPARISONS OF GANcon WITH EXISTING METHODS
We compare GANcon on the independent test dataset with four state-of-the-art deep learning methods, including DNCON2 [12], DeepContact [7], PconsC4 [24] and Deep-Cov [34], and two well-known ECA methods including CCMpred [5] and FreeContact [6]. All of these compared methods are downloaded and implemented in our local computers with default settings. Among these methods, DeepCov, PconsC4, CCMpred and FreeContact are fed with the same MSAs used in GANcon since they do not have a built-in pipeline to generate MSAs, while other methods are fed with protein sequences directly, which is consistent with previous studies [12], [16].
The comparison results in Table 2 for the precisions clearly show that the deep learning methods significantly outperform the ECA methods, which is also corroborated by previous studies [16], [35]. For example, DeepContact achieves a precision of 86.66%, 75.52% and 60.20% for top L/5, L/2 and L long-range contacts, respectively (Table 2), which is 24.85%, 25.88% and 22.92% higher than FreeContact, respectively. At the same time, GANcon performs consistently better than other deep learning methods and the corresponding precision reaches 89.93%, 80.84% and 65.87%, respectively. And the P-values shown in Supplementary Table S6 suggest that the improvement is significant. We also train GANcon with other training-validation dataset ratio (80%-20% and 70%-30%), and the precision results in Supplementary Tables S7 show that there is no obvious difference in the performance. Moreover, as shown in Supplementary Table S8, the F1 score of GANcon is 45.91% for top L/5 long-range contacts, while the next-best deep learning method has the F1 score of 43.01%. All these results indicate that with the novel deep learning architecture, GANcon has a very competitive performance for contact map prediction. In addition, we also provide the PR and ROC curves with the corresponding AUPRC and AUC scores of different methods for long-range contacts in Figure 2 and Figure 3. As shown in Figure 2, the PR curve for long-range contacts confirms GANcon has a better precision under a given level of recall than other methods and the corresponding AUPRC score is 63.57%, which is at least 7% better than other methods investigated in this study. Meanwhile, the PR curves in Supplementary Figure S1 demonstrate the VOLUME 8, 2020   advantage of GANcon for medium-and short-range contacts. Also, the ROC curves with the corresponding AUC scores in Figure 3 and Supplementary Figure S2 show similar results when long-, medium-and short-range contacts are evaluated.
To explore the effect of the number of homologous sequences on the performance of computational methods, we present the precisions of the top L/5 long-range contacts as a function of the maximum log Neff scores in Figure 4, where Neff is defined as the number of effective sequences in MSAs and a higher score implies more homologous sequences in the reference database. As shown in Figure 4, in general, all To compare with well-performing methods in recent CASP experiments, we evaluate the prediction results of RaptorX-Contact [11] from CASP website (http://predictioncenter.org/) and the prediction results of SPOT-Contact [8] (http://sparkslab.org/jack/server/SPOT-Contact/) from its webserver. The comparison results with respect to precisions can be found in Table 3 and Supplementary Tables S9-S10. The baseline model of GANcon has in general comparable performance to most of the investigated methods except the state-of-theart RaptorX-Contact and SPOT-Contact. Meanwhile, we also observe that using adversarial learning and SF loss brings significant performance improvements for all length cutoffs and sequence separations. For example, there are more than 23%, 15% and 8% improvements in precision for top L/5, L/2 and L long-range contacts on 15 CASP13 FM targets ( Table 3). In addition, similar results in F1, AUPRC and AUC scores of GANcon and other methods are shown in Supplementary Tables S11-S14, which further demonstrate that GANcon can be used as a complementary method for protein contact map prediction.

IV. CONCLUSION
Accurate prediction of the protein contact map is of great significance in de novo protein structure prediction. As many carefully designed deep learning architectures have shown remarkable prediction power in many areas of bioinformatics [36]- [38], especially in contact map prediction [8], [11], further exploration of more optimized deep learning architectures for performance improvement is highly desired. In this study, we propose a novel GAN-based architecture, GANcon, for contact map prediction. Different from previous deep learning methods training a single network in protein contact map prediction, GANcon incorporates a discriminator network to promote the generator network to achieve accurate contact map prediction. During the adversarial learning process, the generator network of GANcon captures the underlying contact information from versatile protein features by employing a dedicated encoder-decoder architecture, while the discriminator network learns the differences between generated contact maps and real ones and automatically transfers them back to the generator network. Meanwhile, to deal with the imbalance problem and consider the symmetry of contact maps, a novel SF loss is proposed in this study that together with the adversarial loss, can further enhance the adversarial learning of GANcon for better prediction results. Notably, jointing adversarial learning with SF loss brings consistent improvements in prediction performance across all the datasets assessed in this study, indicating adversarial learning and SF loss might be adopted as a general learning strategy for the task of protein contact map prediction.
Although GANcon shows a promising performance of protein contact map prediction, there is still room for further improvement. The adversarial loss of GANcon is based on pixel level probabilities in the output matrix of discriminator, while the adversarial loss based on the whole contact map level output is also very useful to training GAN model, which can be adopted in our future work. Besides, a well-known problem is that the training of GAN sometimes suffers from instability [39], which also occurs during the training process of GANcon in this study. Therefore, some advanced GAN training methods, such as WGAN [39], can improve training stability and will be explored in our future study. Also, it would be interesting to integrate GAN with other popular deep learning modules, such as long shortterm memory (LSTM) that is confirmed to be effective in contact map prediction [8], to boost the predictive power of GAN-based architectures. Moreover, in addition to the protein features adopted in this study, other important features, e.g. predictions of third-party predictors such as CCMpred and FreeContact, may also be used by GANcon to enhance prediction performance. In conclusion, we propose a novel GAN-based deep learning architecture for contact map prediction, which can efficiently improve the overall performance and serves as an alternative tool for contact map prediction. His research interests include biomedical information processing and brain-inspired computing. VOLUME 8, 2020