Constraint Guided Neighbour Generation for Protein Structure Prediction

Protein structure prediction (PSP) is essential for drug discovery. PSP involves minimising an unknown scoring function over an astronomical search space. PSP has achieved significant progress recently via end-to-end deep learning models that require enormous computational resources and almost all known proteins as training data. In this paper, we develop conformational search methods for PSP based on scoring functions involving geometric constraints learnt by deep learning models. When machine learning models achieve generality and thus obviously loose accuracy, conformational search methods could perform protein-specific fine tuning of the predicted conformations. However, effective conformational sampling in PSP remains a key challenge. Existing conformational search algorithms adopt random selection approaches for neighbor generation and thus greatly depend on luck. We propose a new approach to analyse geometric constraint-based scores, to identify the regions of the current conformations causing inferior scores, and to alter the identified regions to generate neighbour conformations. Our approach prefers informed decisions to random selections from an artificial intelligence perspective. The proposed method also provides promising search guidance as it obtains significant improvements from given initial conformations. Our approach significantly outperforms state-of-the-art PSP search algorithms that use random sampling with a similar scoring function on a set of benchmark proteins of varying types and sizes. Our sample generation approach could be used in other bioinformatics research areas requiring search.


I. INTRODUCTION
Proteins are sequences of amino acid (AA) residues. Proteins fold into three dimensional structures. A protein's AA sequence essentially determines its native structure having the lowest free energy and the native structure essentially determines its function. By docking on a disease protein's native structure, drug molecules inhibit its functions. Protein structure prediction (PSP) by in vitro methods are time consuming, costlier, and failure prone. Computational PSP approaches minimise unknown scoring functions over astronomical search spaces and find decoy structures.
PSP has achieved significant progress recently via Al-phaFold2's end-to-end deep learning models [1]. However, AlphaFold2 needs enormous computational resources and uses almost all known proteins in its training. Moreover, AlphaFold2's algorithmic details are not open. Although its trained model is available, because of the computational resource requirement, most research labs cannot run it locally. Its google Collab interface provides only a restricted access to AlphaFold2. So for further scientific advancement, the immediate challenge to the PSP community is to obtain at least AlphaFold2's accuracy level but using simpler and more efficient PSP methods that depend on fewer training proteins. Further, alternative methods that are based on conformational search approaches could also be investigated. Furthermore, PSP methods should be made available. In this paper, we investigate conformational search methods for PSP using proxy energy functions i.e. scoring functions based on geometric constraints learnt by deep learning models [2]- [4]. PSP search methods include Monte Carlo algorithms [5], evolutionary algorithms [6], multi-objective optimisation [7], sequential search [8], differential evolution [9], memetic algorithms [10], [11], and gradient descent algorithms [2], [4], [12]. In general, these iterative search algorithms generate neighbour conformations i.e. three dimensional structures randomly from the current conformations,evaluate the neighbour conformations using chosen scoring functions, and select the best neighbour conformations as the current conformations for the next iterations. As such the conformation evaluation phase only indirectly guides the search while the conformation generation phase largely remains unguided and dependent on luck.
From artificial intelligence perspectives, our motivation is to generate neighbour conformations based on informed decisions. So we detect problematic parts of the current conformations and make changes mainly in those identified parts. To detect the problematic parts, we use a constraint-guided approach that helps analyse unsatisfied geometric constraints causing inferior scores of the current conformations. Our approach is simple and it explains the selection decisions made by the neighbourhood generation procedure.To the best of our knowledge, this is the first attempt in taking an informed approach in neighbour generation for PSP. The proposed strategy could be useful in other bioinformatics search problems that include structure based drug design.
We evaluate our constraint guided neighbour generation approach within a simple local search framework. For protein structure representation, we use dihedral angles but also compute Cartesian coordinates of the atoms. Moreover, we consider only the main chains or the backbones of the protein structures and exclude the side chains of the amino acid residues. For protein structure evaluation, we use scoring functions based on predicted residue-residue distances. Our algorithm has been implemented on our newly developed constraint-based PSP search platform Koala. We use a set of benchmark proteins of varying types and sizes. Experimental results show that our constraint-based neighbour generation approach significantly outperforms other random-based PSP search approaches. Our proposed approach also significantly outperforms state-of-the-art PSP search algorithms that use random sampling with similar scoring functions.
The rest of the paper details the problem formulation, the idea/implementation, experimental results, and conclusions.

II. RELATED WORK
Considering the relevance with this work, we mainly explore the search and optimisation approach for PSP.
The kinetic energy of a protein has not been precisely known or defined so far, but physical (Van der Walls forces), chemical (bond energies), and electrostatic (Coulomb forces) energy components have been used in protein structure scoring functions based on molecular dynamics e.g. in CHARMM [13]. Another such successful scoring function used in PSP research is the ROSETTA [14] energy function. Nevertheless, energy functions that involve all-atomic details are computationally very expensive. Note that the energy value is to be computed for each conformation generated during search.
Quark [5] constructs structures using fragment assembly, refines them using replica-exchange Monte Carlo simulations, and uses a composite knowledge-based force field. Quark's force field has eleven terms that include atomiclevel, residue-level, and topology-level terms. These terms are knowledge based but also have direct physical basis.
Differential evolution (DE) has been considered very effective in PSP problem. An underestimation-assisted global and local cooperative DE (GLCDE) improves the search capability of DE [6]. In GLCDE, the global phase tries to locate promising regions quickly whereas the local phase serves as a local search for improving convergence. To get the underestimation of the objective function, on the basis of the abstract convexity theory, GLCDE designs an adaptive underestimation model in which the slope control factor of the supporting vectors is dynamically updated based on the evaluated trial individual. AIMOES [7] is a multi-objective optimisation technique which reuses past search experiences carried by a decision maker to select representative solutions. It includes three different physical energy terms: bond energy, non-bond energy, and solvent accessible surface area. MODE-K [9] presents a multi-objective differential evolution algorithm and maintains an archive of optimal solutions. MODE-K uses RWplus [15] as the energy function and decomposes the energy function into two terms to get multiple objectives: a distance-dependent energy term and an orientation-dependent term.
Sequential search is used in SAINT2 [8], in which an independent fragment-assembly structure predictor predicts both sequential or non-sequential structures. SAINT2 uses a combination of knowledge based and physical potentials as energy functions.
PSP search methods also include memetic algorithms. A knowledge based memetic algorithm [11] shows that the angle Probability List strategy is quite useful in order to identify distinct structural patterns. As a energy function, it uses ROSETTA [14] and solvent accessible surface area (SASA) [16].
trRosetta [2], [12] claims that gradient descent algorithm is useful in solving the PSP problem. As energy function, trRosetta uses the ROSETTA [14] energy function.
As alternatives to scoring functions based on molecular dynamics, knowledge based scoring functions obtained by machine learning algorithms have also been used in PSP search e.g. residue-residue distance maps and residue-residue contact maps (whether residue-residue distances are within 8Å). In distance and contact maps, residues are represented by C β atoms except by C α for Glycine. SPOT-Contact [17] is a recent contact map prediction method and CONFOLD [18], [19], MULTICOM [20], and CGLFOLD [3] are recent methods that use contact maps in PSP search. Recently residueresidue distance map based scoring functions have shown promise. [21], [22] RaptorX [23]- [25] and AlphaFold [26] predict distance maps and use them in their search algorithms for protein structures.  Each instance of an AA in a protein is a residue. One residue's C atom is connected with another residue's N atom to form a peptide bond. Thus, we get the main chain or the backbone of a protein. Besides the main chain, each AA except Glycine has a unique side chain starting from the C α atom and C β is the first atom in a side chain. Assuming standard bond distances and angles, the main chain of a protein can be represented by three rotatable dihedral angles ϕ, ψ, and ω that allow folding. These three angles are respectively defined by each four successive atoms from the sequence

III. PROBLEM FORMULATION
For most proteins, ω is 180 • [27], but ϕ and ψ can take any values from −180 • to +180 • . The side chains at individual AAs have their own dihedral angles, but in this work, we mainly focus on searching for backbone ϕ and ψ angles of the main chain. Using backbone ϕ and ψ angles found, one can first obtain the main chain and then can later deal with the side chains to get the full protein structure. FIGURE 1 Middle shows the backbone of an entire protein when folded into a three dimensional shape. FIGURE 1 Right shows protein structures exhibit certain local flexible and rigid regions that are called secondary structures (SS). Rigid regions such as helices and sheets are comparatively easier to be modelled since most residues in these regions have been observed to take ϕ and ψ values from very narrow ranges of about 20 • . Finding the ϕ and ψ values for the flexible loop regions is challenging since they can take any fractional values in [−180, +180]. About 40% residues in a protein are in loops [28] and loop sampling methods strive to find dihedral angles to model them.
The energy function of a protein is not known precisely. Physical (Van der Walls forces), chemical (bond energies), and electrostatic (Coulomb forces) components are used in scoring functions based on molecular dynamics e.g. CHARMM [13] and ROSETTA [14]. These scoring functions involve all atomic-details and are computationally very expensive. Note that the scoring function is to be computed for each conformation generated during search. As alternatives to scoring functions based on molecular dynamics, knowledge based scoring functions obtained by machine learning algorithms have been used in PSP search e.g. residue-residue distance maps [4], [26], contact maps [3], [17], [20], and angle orientations [2]. In any of these residueresidue maps, residues are represented by C β atoms except by C α for Glycine. Further, in contact maps, contacts denote whether residue-residue distances are within 8Å.
To evaluate our constraint-guided neighbour generation approach in PSP, in this work, we perform loop sampling with an residue-residue distance or contact map based scoring function. However, our neighbour generation approach could also be used to refine models for helixes and sheets.

IV. IDEA ILLUSTRATION
Assume i and j be two residues of a protein and d ij be the prediction made by a given machine learning algorithm about the distance between residues i and j in the native structure of the protein. Also, assume c be the current conformation of the given protein during search and d c ij be the current distance between residues i and j in the conformation c. FIGURE 2 Left shows d ij = 6.5 acts as a constraint and FIGURE 2 Right shows d c ij = 9.25 violates the constraint. To bring residue i and j closer to each other and thus to satisfy the constraint, we change the dihedral angles of a residue k, which is in between residues i and j. Left: distance between two helix residues i and j as predicted by a machine learning algorithm. Right: changing the dihedral angles of a loop residue k changes the distance between i and j to obtain the predicted distance.
During search, in each iteration, we heuristically choose residues i and j with d c ij the furthest from d ij . Moreover, we randomly choose k from loop residues only. Note that we perform loop sampling in this work, leaving helices and sheets the same after first construction in initialisation.
Our key contribution in this paper is selection of k explicitly based on its potential to change the distance between residues i and j, which are in violation of the predicted distance constraint. Existing algorithms basically randomly select residue k without having any explicit knowledge of i and j and hence could waste the search effort.

V. IMPLEMENTATION DETAILS
FIGURE 3 shows our PSP pipeline. From the AA sequence of a protein, we first use machine learning approaches to VOLUME 4, 2016 predict the main-chain angles, residue-residue distance or contact maps, and secondary structures. Using the predictions in various ways, we then adopt optimisation search approaches to perform conformation initialisation and evaluation, and then in an iterative fashion generate and evaluate conformation neighbourhood and accept the best neighbour conformation as the current conformation for the next iteration. We describe each stage of our pipeline. However, our main contribution is in the optimisation search approach, more particularly in the neighbourhood generation step.

A. USING MACHINE LEARNING ALGORITHMS
There are several main-chain angle prediction methods i.e. SPIDER2 [29] SPOT-1D [2], SAP [30]. SPIDER2 predicts with mean-absolute errors (MAE) of about 19.7 and 30.3 degrees for ϕ and ψ angles respectively. For SAP, MAE values are 15.66 and 18.59 degrees respectively and for SPOT-1D, respectively 16 and 23 degrees. After some preliminary experiments, we have found that SPOT-1D predictions led to better three dimensional structures, mainly because it captures better overall shape than local structures.
Many inter-residue distance prediction algorithms have been proposed in recent few years. These include RaptorX [4], PDNET [31] and DeepDist [32]. PDNET and DeepDist both have MAE values 4.1Å whereas RaptorX [4] has MAE less than 4Å. So we chose RaptorX over others as it has less MAE. RaptorX predicts distances for residue pairs having at least 12 other residues in between them in the sequence and within predicted distances less than or equal to 15Å.
For residue-residue contact map prediction, there are several methods, e.g. SPOT-Contact [17], RaptorX-Contact [33], Dncon2 [34]. Dncon2 obtains less than 70% precision, Raptor-X achieves less than 80% precision, and SPOT-Contact gets more than 80% precision for top L/10 predictions for short, medium and long-range contacts, where L denotes the number of residues in the protein. So we use SPOT-Contact for contact prediction.
For secondary structure prediction, a few methods have been proposed: PSIPred [35], DISTILL [36]. SSpro8 [37] achieves the highest accuracy levels of about 92% and 79% respectively with and without using homologous proteins. So we choose SSpro8 for secondary structure prediction.

B. USING OPTIMISATION SEARCH ALGORITHMS
We describe conformation representation, generation, and evaluation along with scoring functions used in our search.
We mainly use distance maps in scoring functions, but we also experiment with contact maps. We describe our scoring functions below where numeric parameters are fixed after preliminary experiments, but for the sake of brevity, we do not show those results. Note scores are not defined for sequentially proximate residues i and j with i − j < 3. For a pair of residues i and j, RaptorX [4] provides a predicted distance d ij and a deviation δ ij in the prediction. Using these, we define minimum and maximum allowable distances m ij = d ij − δ ij and M ij = d ij + δ ij , and relative error r ij = δ ij /d(ij). Consequently, we do not include residue pairs for which relative errors are 0.5 or more. Next, for a current conformation c, we define a partial score s c ij and the total score s c as below. FIGURE 4 (left) shows our distance map based scoring function s c ij with m ij = 5 and M ij = 7 for any residue pair i and j. As we see, the lower bound of the score for a residue pair is −1.  For a pair of residues i and j, SPOT-Contact [17] provides a predicted probability p ij for the residue pair to be in contact. We consider residue pairs with contact probabilities at least 0.3. We give more emphasis on a greater probability. Moreover, we consider two residues are in contact when their distance is in between a minimum distance d = 3.8Å and a maximum distance D = 8.0Å. Next, for a current conformation c, we define a partial score σ c ij and the total score σ c as below. FIGURE 4 (middle) shows our contact map based scoring function σ c ij with d = 3.8, D = 8.0, and p ij = 0.7 for any residue pair i and j. As we see, the lower bound of the score for a pair of residues with indexes i and j is −p ij . Note our scoring function is somewhat similar to the bounded potential [14] and the square well [38] functions.
Our distance or contact based scoring functions do not include all residue pairs. So to avoid steric clashes between residues, we define another scoring function. For this, we consider a clash when residue pairs have C α atoms within Θ = 3.6Å of each other. For a current conformation c, we define a partial score χ c ij and the total score χ c as below. FIGURE 4 (right) shows our steric clash based scoring function χ c ij with Θ = 3.6 for a given residue pairs with indexes i and j.
We primarily represent a conformation by the ϕ and ψ values of the residues. However, to use distance or contact maps, we also compute coordinates but only for N , C α , C, and C β atoms of each residue. During search when ϕ or ψ values are changed, to generate neighbour conformations, we recompute the coordinates of the atoms that will be affected by the changes. Algorithm 1 shows the pseudocode of our simple local search algorithm for PSP. In this algorithm, geometric constraints learnt by machine learning algorithms have been turned into objective functions via the scoring functions. Local search algorithms usually generate neighbours randomly or in a generic way unrelated to the specific problem. Constraint guided sampling embeded within local search provides problem specific knowledge coded as constraints. Nevertheless, Algorithm 1 uses distance map based scoring function s c but s c could be easily replaced by contact map based scoring function σ c . We further discuss the details of the algorithm. In Algorithm 1 Line 1, we take predicted ϕ and ψ values and MAE values ϕ MAE and ψ MAE from SPOT-1D [39]. We then generate random values from ranges [ϕ−ϕ MAE , ϕ+ϕ MAE ] and [ψ − ψ MAE , ψ + ψ MAE ]. We also consider another alternative initialisation procedure: we use SSpro8 [37] predicted secondary structures and generate ϕ and ψ values randomly from the ranges shown in TABLE 1. Note once initialised, dihedral angles of only loop residues are changed by search. This is because SPOT-1D predictions for helix and sheet residues have smaller errors than those for loops. MinClash ← (χ c > χ t ) ∧ probability(0.5) // χ t = 2 8: if MinClash then // minimise clash based score 9: ⟨i, j⟩ ← argmax ⟨i ′ ,j ′ ⟩:¬tabu(⟨i ′ ,j ′ ⟩,τ ) χ c i ′ j ′ 10: else // minimise distance based score 11: ⟨i, j⟩ ← argmax ⟨i ′ ,j ′ ⟩:¬tabu(⟨i ′ ,j ′ ⟩,τ ) s c 14: ∆ϕ ← N random values from ∆Φ // N = 20 15: ∆ψ ← N random values from ∆Ψ // N = 20 16: C ← {c n : add ∆ϕ[n] to ϕ k , ∆ψ[n] to ψ k in c} 17: Evaluate each c n ∈ C by computing scores s cn , χ cn

18:
if MinClash then // minimise clash based score 19: c ′ ← argmin cn χ cn 20: else // minimise distance map based score 22: c ′ ← argmin cn s cn 23: In Algorithm 1 Lines 2 and 17, we evaluate a conformation c by computing the distance map based score s c and the steric clash based score χ c . We do not add s c and χ c , since their normalisation is not straightforward. Consequently, we have a two-objective minimisation problem, where s c is the primary objective. However, at a time, we mainly work with one objective function, which is chosen in Line 7 in Algorithm 1. If χ c is more than a threshold χ t = 2, with 50% probability, we minimise χ c ; otherwise, we minimise s c .
Using the scoring function selected in Line 7 in Algorithm 1, we choose a residue pair i and j with the worst score (the tabu condition is discussed later) in Lines 9 and 11. In Line 13, we then choose a random loop residue k, which is in between i and j. Next, in Lines 14 and 15, we choose N angle VOLUME 4, 2016 differences in each of ∆ϕ and ∆ψ respectively from sets ∆Φ and ∆Ψ. Note ∆Φ and ∆Ψ as defined in Lines 3 and 4 hold values in intervals of 3 • from ranges [−ϕ MAE , +ϕ MAE ] and [−ψ MAE , +ψ MAE ] respectively. Then, in Line 16, we generate N neighbour conformation using the angle differences in ∆ϕ and ∆ψ.
Revisitation is a problematic issue in local search. In Algorithm 1 Lines 9 and 11, the same i and j could be repeatedly selected. To avoid revisitation, we use the tabu metaheuristic [41]. With tabu initialised in Line 5, enforced in Line 26, and checked via tabu(⟨i ′ , j ′ ⟩, τ ) in Lines 9 and 11, recently selected i and j will not be selected again in Lines 9 and 11 within a number (called tabu tenure T ) of future iterations. In this work, we do not apply tabu on the selection of residue k in Line 13. j: Accepting Best Neighbour.
When we improve one objective function, we do not want to worsen the other one. Moreover, improving the partial score s c ′ ij is the primary reason to select the residues i and j in Line 11. So when distance map based scoring function s c is chosen in Line 7, we accept neighbour c ′ with best s c ′ in Line 24, if the partial score s c ′ ij is strictly better than s c ij . We also alternatively accept c ′ when s c ′ is strictly better than s c and χ c ′ is not worse than χ c . Next, steric clash minimisation is basically a secondary objective. So when steric clash based score χ c is chosen in Line 7, we accept neighbour c ′ with best χ c ′ in Line 20, if χ c ′ is strictly better than χ c and s c ′ is not worse than s c . k: Implementation Platform.
We implement our algorithms on top of a recently developed Python-based PSP search platform named Koala, which draws concepts from a constraint based local search system named Kangaroo [42].

VI. EXPERIMENTAL RESULTS
All the algorithms are executed on a Linux 64-bit system with Intel® Xeon® X3470 293 X 8 GHz and 8GB memory .

A. DATASET
Our dataset includes 14 α type, 11 β type, and 10 α/β type proteins. These proteins are from existing PSP search algorithms such as QUARK [5], MODE-K [9], and MOD-CSA/CA [43] or a machine learning algorithm such as SPOT-1D [39]. We have used CD-HIT and PSI-BLAST [44] to ensure the proteins do not have more than 25% sequence similarity with the training proteins of the previously-mentioned machine learning algorithms used in our implementation.

B. COMPARISON OF OUR ALGORITHM VERSIONS
Besides the steric clash based scoring function χ c , Algorithm 1 uses (i) distance map based scoring function s c , (ii) tabu with tenure 15, (iii) initialisation using predictions from SPOT-1D [39], (iv) selection of residue pairs i and j based on scoring functions χ c ij or s c ij , and (v) generation of ∆ϕ and ∆ψ values from the ranges determined by the MAE values of SPOT-1D. To test the effectiveness of each of the components mentioned, we create the following 7 versions of the proposed algorithm.
dm: is the exact algorithm as is described in Algorithm 1 with the 5 components mentioned above. cm: uses the contact map based scoring function σ c instead of the distance map based scoring function s c in dm. nt: does not use the tabu metaheuristic used in dm and so more revisitation of selected residue pairs could occur. rp: selects residue pairs i and j randomly but still satisfying the condition i − j ≥ 3 ∧ r ij < 0.5 as is needed in the definition of the distance map based scoring function. rl: randomly selects a loop region first and then a random residue k from that loop. Note dm first selects residue pairs i and j using chosen scoring functions and then selects a loop residue in between residues i and j. ri: unlike dm, initialises the ϕ and ψ angles randomly but using the SS specific angle ranges shown in TABLE 1. fr: like ri, initialises the ϕ and ψ angles randomly from the SS specific angle ranges and unlike dm, generates ∆ϕ and ∆ψ values from the full range of We run each of the 7 versions of our algorithm on each protein 5 times. Each run has the maximum iteration M = 8000 and the number of neighbours generated in each iteration N = 20. So each run essentially explores 160, 000 conformations; this is the same number of conformations explored by CGLFOLD [3]. Nevertheless, from each run, we take 5 best conformations in terms of the respective distance or contact map based scoring function used. Then, we compute mean Root Mean Square Deviation (RMSD) value over the 25 conformations for each protein for the same algorithm version and show in TABLE 2 (top left). Among our 7 versions, as we see in TABLE 2 (bottom left), dm obtains the best mean RMSD values in 18 out of 35 proteins and 2nd best mean RMSD values in 9 proteins. We perform Wilcoxon signed rank test with 95% confidence interval on dm against the other 6 versions and p-values are at most 0.0008. This indicates dm's performance is statistically significantly different from the other versions. Moreover, TABLE 2 (bottom) also shows the numbers of proteins in which various versions obtain mean RMSD values ≤ various threshold values such as 6Å, 9Å, and 12Å. Clearly, dm TABLE 2. Top: mean RMSD values obtained for proteins (left) by proposed algorithm variants (center) and state-of-the-art algorithms (right). Bottom: the number of proteins with mean RMSD values the best (emboldened) and the 2nd best (underlined), and also the number of proteins with mean RMSD values ≤ various threshold levels when our algorithm variants are compared with each other and when our best version is compared with the state-of-the-art algorithms. Note that CGNP is actually dm. obtains the best performance among the versions particularly with thresholds 6Å and 9Å. From these results, it is clear that each component of dm is important for its performance. We will perform further analysis later in the paper. Henceforth, we name our best algorithm version dm as Constraint Guided Neighbours for PSP (CGNP).

C. COMPARISON WITH STATE-OF-THE-ART METHODS
We compare our proposed CGNP with two most related recent PSP search methods CGLFOLD [3] and trRosettaX [12]. CGLFOLD performs perturbation based loop sampling along with predicted contact map based scoring function. On the other hand, trRosettaX performs gradient minimisation along with a scoring function that has components based on predicted distance maps and inter-residue angle orientations. Of course both CGLFOLD and trRosettaX randomly generate neighbour conformations.
For running CGLFOLD on the proteins and computing mean RMSD values, we use the same setting that we have used in the experiments with our algorithm versions. For trRosettaX, we use only the distance based component in the scoring function, since CGNP uses distance based scores. Note that our main objective in this paper is not to explore scoring functions but is rather to see the effectiveness of our constraint-guided neighbour generation approach over existing random-based approaches. However, to investigate that, we do need a scoring function and we use distant based ones. Nevertheless, trRosettaX returns just one conformation per run. So we run trRosettaX on each protein 25 times and compute mean RMSD values over the 25 conformations. proteins. We perform Friedman test with 95% confidence level on CGNP, CGLFOLD, and trRosettaX performances and get p-value 0.0027. Then, for posthoc analysis, we perform Nemenyi test with 95% confidence level to compute pairwise differences among the three algorithms. From the test results, we see that CGLFOLD and trRosettaX have no statistically significantly difference with p-value 0.6046 but CGNP is statistically significantly different from CGLFOLD and trRosettaX with p-values 0.0026 and 0.0444 respectively.
We run all the algorithms on three proteins of three different types and check their running time in table TABLE 3. Note that these algorithms have been implemented on different platforms and programming languages. For example, our method and CGLFOLD are implemented on Python, which as a programming language and platform is by default slow. On the other hand trRosettaX is implemented on C/C++ programming language and is so inherently fast.    shows the best distance map based scores obtained so far in each iteration of sample runs of rp, rl, and dm versions of our algorithm for a sample protein 1IS7. Clearly, dm keeps improving the distance map based scores while rp and rl get somewhat stuck in plateaus in terms of achieving better scores. These results show the effectiveness of our constraint-guided neighbour generation approach of dm over the random selection based approaches of rp and rl in terms of improving the distance map based scores. FIGURE 9 shows sample RMSD distributions of the conformations generated by the sample runs of the rp, rl, and dm versions of our algorithm for one sample protein 1IS7. These three versions are for the various ways we select the residue pairs or the loop regions to eventually select another residue of which the ϕ and ψ angles will be changed. Clearly, selecting a random loop region by rl is the worst among the three versions as rl explores inferior conformations. Between selecting a random pair by rp and a greedy pair by dm, the greedy pair selection explore more promising conformations in most cases. These results show the effectiveness of our constraint-based conformation generation approach in dm over the random selection based approaches in rp and rl in terms of exploring higher quality conformations.

VII. CASP13 AND CAMEO144 PROTEINS
We have also run our method with the same experimental setting as describe before on 20 proteins from CASP13 protein and CAMEO144 hard target test set and compared it with a very recent method trRosettaX [12] and reported RMSD and GDT-TS score. GDT-TS score has been used in ranking PSP methods that took part in CASP14. In most of the proteins as shown in TABLE 4, our method achieve better result than trRosettaX.

VIII. CONCLUSIONS
Protein structure prediction (PSP) has achieved significant progress lately via development of geometric constraint based scoring functions. However, sample generation for PSP remains challenging as existing search algorithms take random based approaches. We propose a constraint-guided novel approach to identify problematic parts of a current conformation and then to make changes to those parts to generate neighbour conformations. Our approach thus makes informed decisions in neighbour generation and explains its performance. On a set of benchmark proteins of varying types and sizes, our approach significantly outperforms state-ofthe-art PSP search algorithms that use random sampling with similar scoring functions.